65
STATISTICAL LEARNING (FROM DATA TO DISTRIBUTIONS)

S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

Embed Size (px)

Citation preview

Page 1: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

STATISTICAL LEARNING(FROM DATA TO DISTRIBUTIONS)

Page 2: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

AGENDA

Learning a discrete probability distribution from data

Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation

Page 3: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MOTIVATION

Previously studied how to infer characteristics of a distribution, given a fully-specified Bayes net

This lecture: where does the specification of the Bayes net come from? Specifically: estimating the parameters of the

CPTs from fully-specified data

Page 4: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

LEARNING IN GENERAL

Agent has made observations (data) Now must make sense of it (hypotheses) Why?

Hypotheses alone may be important (e.g., in basic science)

For inference (e.g., forecasting) To take sensible actions (decision making)

A basic component of economics, social and hard sciences, engineering, …

Page 5: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MACHINE LEARNING VS. STATISTICS

Machine Learning automated statistics This lecture

Statistical learning (aka Bayesian learning) Maximum likelihood (ML) learning Maximum a posteriori (MAP) learning Learning Bayes Nets (R&N 20.1-3)

Future lectures try to do more with even less data Decision tree learning Neural nets Support vector machines …

Page 6: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

PUTTING GENERAL PRINCIPLES TO PRACTICE

Note: this lecture will cover general principles of statistical learning on toy problems

Grounded in some of the most theoretically principled approaches to learning But the techniques are far more general Practical applications to larger problems requires

a bit of mathematical savvy and “elbow grease”

Page 7: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BEAUTIFUL RESULTS IN MATH GIVE RISE TO SIMPLE IDEAS

OR

HOW TO JUMP THROUGH HOOPS IN ORDER TO JUSTIFY SIMPLE IDEAS …

Page 8: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

CANDY EXAMPLE

• Candy comes in 2 flavors, cherry and lime, with identical wrappers

• Manufacturer makes 5 indistinguishable bags

• Suppose we draw• What bag are we holding? What flavor will we draw

next?

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 9: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYESIAN LEARNING

Main idea: Compute the probability of each hypothesis, given the data

Data d: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 10: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYESIAN LEARNING

Main idea: Compute the probability of each hypothesis, given the data

Data d: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

P(hi|d)

P(d|hi)

We want this…

But all we have is this!

Page 11: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

USING BAYES’ RULE

P(hi|d) = a P(d|hi) P(hi) is the posterior

(Recall, 1/a = P(d) = Si P(d|hi) P(hi))

P(d|hi) is the likelihood

P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 12: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

LIKELIHOOD AND PRIOR

Likelihood is the probability of observing the data, given the hypothesis model

Hypothesis prior is the probability of a hypothesis, before having observed any data

Page 13: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

COMPUTING THE POSTERIOR

Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) d = { 10 x }

P(d|h1) = 0P(d|h2) = 0.2510

P(d|h3) = 0.510

P(d|h4) = 0.7510

P(d|h5) = 110

P(d|h1)P(h1)=0P(d|h2)P(h2)=9e-8P(d|h3)P(h3)=4e-4P(d|h4)P(h4)=0.011P(d|h5)P(h5)=0.1

P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90

Sum = 1/a = 0.1114

Page 14: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

POSTERIOR HYPOTHESES

Page 15: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

PREDICTING THE NEXT DRAW

P(X|d) = Si P(X|hi,d)P(hi|d)

= Si P(X|hi)P(hi|d)

P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90

H

D X

P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1

Probability that next candy drawn is a lime

P(X|d) = 0.975

Page 16: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

P(NEXT CANDY IS LIME | D)

Page 17: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

PROPERTIES OF BAYESIAN LEARNING

If exactly one hypothesis is correct, then the posterior probability of the correct hypothesis will tend toward 1 as more data is observed

The effect of the prior distribution decreases as more data is observed

Page 18: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

HYPOTHESIS SPACES OFTEN INTRACTABLE To learn a probability distribution, a

hypothesis would have to be a joint probability table over state variables 2n entries => hypothesis space is 2n-1-

dimensional! 2^(2n) deterministic hypotheses

6 boolean variables => over 1022 hypotheses And what the heck would a prior be?

Page 19: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

LEARNING COIN FLIPS Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Suppose draws are independent and

identically distributed (i.i.d) Observe that c out of N draws are cherries

(data)

Page 20: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

LEARNING COIN FLIPS Let the unknown fraction of cherries be q

(hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!)

Page 21: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q

P(d|q) = Pj P(dj|q) = qc (1-q)N-c

i.i.d assumption Gather c cherry terms together, then N-c lime terms

Page 22: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(d

ata

|q)

Page 23: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(d

ata

|q)

Page 24: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

2/3 cherry

q

P(d

ata

|q)

Page 25: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(d

ata

|q)

Page 26: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

2/5 cherry

q

P(d

ata

|q)

Page 27: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(d

ata

|q)

Page 28: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

Page 29: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

Peaks of likelihood function seem to hover around the fraction of cherries…

Sharpness indicates some notion of certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

Page 30: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

P(d|q) be the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

Page 31: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

So the MLE is the same as maximizing log likelihood… but: Multiplications turn into additions We don’t have to deal with such tiny numbers

Page 32: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]

Page 33: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]

Page 34: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Page 35: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq( )q = 0 at the maximum likelihood

estimate

Page 36: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD

The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq( )q = 0 at the maximum likelihood

estimate => 0 = c/q – (N-c)/(1-q)

=> q = c/N

Page 37: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

OTHER CLOSED-FORM MLE RESULTS

Multi-valued variables: take fraction of counts for each value

Continuous Gaussian distributions: take average value as mean, standard deviation of data as standard deviation

Page 38: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD FOR BN

For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|E,B: 188/200A|E, B: 170/500A|E,B : 1/380

E B P(A|E,B)

T T 0.95

F T 0.95

T F 0.34

F F 0.003

Page 39: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYES NET ML ALGORITHM

Input: Bayes net with nodes X1,…,Xn, dataset D=(d[1],…,d[N]) Each d[i] = (x1[i],…,xn[i]) is a sample of the full

state of the world

For each node X with parents Y1,…,Yk: For all y1Val(Y1),…, ykVal(Yk)

For all xVal(X) Count the number of times (X=x, Y1=y1,…, Yk=yk) is

observed in D. Let this be mx

Count the number of times (Y1=y1,…, Yk=yk) is observed in D. Let this be m. (note m=xmx)

Set P(x|y1,…,yk) = mx / m for all xVal(X)

Page 40: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD PROPERTIES

As the number of data points approaches infinity, the MLE approaches the true estimate

With little data, MLEs can vary wildly

Page 41: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM LIKELIHOOD IN CANDY BAG EXAMPLE

hML = argmaxhi P(d|hi)

P(X|d) P(X|hML)

hML = undefined h5

P(X|hML)

P(X|d)

Page 42: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BACK TO COIN FLIPS

The MLE is easy to compute… but what about those small sample sizes?

Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

A particularly acute problem for BN nodes with many parents!

(the data fragmentation problem)

Page 43: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM A POSTERIORI ESTIMATION

Maximum a posteriori (MAP) estimation Idea: use the hypothesis prior to get a better

initial estimate than ML, without resorting to full Bayesian estimation “Most coins I’ve seen have been fair coins, so I

won’t let the first few tails sway my estimate much”

“Now that I’ve seen 100 tails in a row, I’m pretty sure it’s not a fair coin anymore”

Page 44: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM A POSTERIORI

P(q|d) is the posterior probability of the hypothesis, given the data

argmaxq P(q|d) is known as the maximum a posteriori (MAP) estimate

Posterior of hypothesis q given data d={d1,…,dN} P(q|d) = 1/a P(d|q) P(q) Max over q doesn’t affect a So MAP estimate is argmaxq P(d|q) P(q)

Page 45: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAXIMUM A POSTERIORI

hMAP = argmaxhi P(hi|d)

P(X|d) P(X|hMAP)

hMAP = h3 h4 h5

P(X|hMAP)

P(X|d)

Page 46: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

ADVANTAGES OF MAP AND MLE OVER BAYESIAN ESTIMATION Involves an optimization rather than a large

summation Local search techniques

For some types of distributions, there are closed-form solutions that are easily computed

Page 47: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BACK TO COIN FLIPS

Need some prior distribution P(q) P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

Page 48: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAP ESTIMATE

Could maximize qc (1-q)N-c P(q) using some optimization

Turns out for some families of P(q), the MAP estimate is easy to compute

10 q

P(q)

Beta distributions

Page 49: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BETA DISTRIBUTION

Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

Page 50: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

POSTERIOR WITH BETA PRIOR

Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[ ]q =(c+a)/(N+a+b)

Page 51: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

POSTERIOR WITH BETA PRIOR

What does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

MAP estimate is a/(a+b)Effect of prior diminishes

with more data

Page 52: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

MAP FOR BN

For any BN, the MAP parameters of any CPT can be derived by the fraction of observed values in the data + the prior virtual counts

Alarm

Earthquake Burglar

E: 500+100 B: 200+200

N=1000+1000

P(E) = 0.5 0.3 P(B) = 0.2 0.2

A|E,B: (19+100)/(20+100)A| E,B: (188+180)/(200+200)A|E, B: (170+100)/(500+400)A|E,B : (1+3)/(380 + 300)

E B P(A|E,B)

T T 0.95 0.99

F T 0.94 0.92

T F 0.34 0.3

F F 0.0030.006

Page 53: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

EXTENSIONS OF BETA PRIORS

Multiple discrete values: Dirichlet prior Mathematical expression more complex, but in

practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian

distributions: Gamma prior Conjugate priors preserve the representation

of prior and posterior distributions, but do not necessary exist for general distributions

Page 54: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

RECAP

Bayesian learning: infer the entire distribution over hypotheses Robust but expensive for large problems

Maximum Likelihood (ML): optimize likelihood Good for large datasets, not robust for small

ones Maximum A Posteriori (MAP): weight

likelihood by a hypothesis prior during optimization A compromise solution for small datasets Like Bayesian learning, how to specify the prior?

Page 55: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

HOMEWORK

Read R&N 18.1-3

Page 56: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

APPLICATION TO STRUCTURE LEARNING

So far we’ve assumed a BN structure, which assumes we have enough domain knowledge to declare strict independence relationships

Problem: how to learn a sparse probabilistic model whose structure is unknown?

Page 57: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYESIAN STRUCTURE LEARNING

But what if the Bayesian network has unknown structure?

Are Earthquake and Burglar independent? P(E,B) = P(E)P(B)?

Earthquake Burglar

?B E

0 0

1 1

0 1

1 0

… …

Page 58: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYESIAN STRUCTURE LEARNING

Hypothesis 1: dependent P(E,B) as a probability table has 3 free

parameters Hypothesis 2: independent

Individual tables P(E),P(B) have 2 total free parameters

Maximum likelihood will always prefer hypothesis 1!

Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

Page 59: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

BAYESIAN STRUCTURE LEARNING

Occam’s razor: Prefer simpler explanations over complex ones

Penalize free parameters in probability tables

Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

Page 60: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

ITERATIVE GREEDY ALGORITHM

Fit a simple model (ML), compute likelihood Repeat:

Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

Page 61: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

ITERATIVE GREEDY ALGORITHM

Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN

Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

Page 62: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

ITERATIVE GREEDY ALGORITHM

Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN Compute new ML probabilities

Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

Page 63: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

ITERATIVE GREEDY ALGORITHM

Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN Compute new ML probabilities If new likelihood exceeds old likelihood +

threshold, Add the edge

Otherwise, repeat with a different edge Earthquake Burglar

B E

0 0

1 1

0 1

1 0

… …

(Usually log likelihood)

Page 64: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

OTHER TOPICS IN LEARNING DISTRIBUTIONS

Learning continuous distributions using kernel density estimation / nearest neighbor smoothing

Expectation Maximization for hidden variables Learning Bayes nets with hidden variables Learning HMMs from observations Learning Gaussian Mixture Models of continuous

data

Page 65: S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

NEXT TIME

Introduction to machine learning R&N 18.1-2