37
Clustering under Clustering under Constraints with Constraints with Genetic Algorithms Genetic Algorithms by by Albert Ali Salah Albert Ali Salah Stanislav Redman Stanislav Redman Gabriella Kovacs Gabriella Kovacs

Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Embed Size (px)

Citation preview

Page 1: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Clustering under Clustering under Constraints with Constraints with

Genetic AlgorithmsGenetic Algorithms

by by

Albert Ali Salah Albert Ali Salah

Stanislav Redman Stanislav Redman

Gabriella KovacsGabriella Kovacs

Page 2: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

OutlineOutline• Definition of the problem• Background on genetic algorithms• Case study: Workgroup assignment• Results

Page 3: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Clustering under ConstraintsClustering under Constraints• N multi-dimensional data items • A bunch of soft constraints• (A bunch of hard constraints)• The problem: Clustering the data

points so that the hard constraints are satisfied, and the soft constraints are optimized.

Page 4: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Constrained ClusteringConstrained Clustering• Constrained clustering is an unsupervised

learning technique, where some data items are known to be in the same cluster, and some are known to be in different clusters.

• Clustering under constraints is an optimization problem (I saw Karp in the elevator, and he said it’s probably NP-complete)

Page 5: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Genetic AlgorithmsGenetic Algorithms

• A GA is essentially a heuristic random search tool

• Has no rigorous mathematical principle, no one knows why it works

• Used frequently in soft constraint optimization, rarely in clustering

Page 6: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Details You All KnowDetails You All Know• Solutions are ‘coded’ into simple, DNA-

like structures called chromosomes• A fitness function is supplied to evaluate

the quality of solutions• The algorithm works on a population of

individuals• There is a Genetic Algorithm package

written for the object-oriented Dolphin Smalltalk environment

Page 7: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Genetic Algorithm FlowchartGenetic Algorithm Flowchart

Initial Population

End CriteriaReached?

Selection Cross-over

Mutation

New Population

No

YesOutput Best

Individual

Page 8: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Case Study: Santa FeCase Study: Santa Fe• Aim: Cluster people such that:

– Groups are balanced in number of students

– Each group consists of people with similar interests

– Each group has some people with basic skills

– Each group possesses enough knowledge in its areas of interest

Page 9: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Problem 1: RepresentationProblem 1: Representation• A good GA representation is:

– unambiguous– short (k bits means 2k search space)– smooth with respect to fitness

landscape– robust to mutations– free of preferential bias– simple to decode

Page 10: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

• 01101001010010101001010…

• 01101001010010101001010…

Representation Representation

Three bits code the group number

The position indicates the student number

1 2 3 4…

Page 11: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Problem 2: FitnessProblem 2: Fitness• A good fitness function is:

– between 0 (awful) and 1 (optimal)– a correct ordering of individuals with

respect to their closeness to the optimal solution

– informative, and indicative of relative fitness

– pragmatic about the boundary conditions– simple and fast to calculate

Page 12: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Composite FitnessComposite Fitness• Assume there are n different, possibly

independent fitness criteria. Let f1, f2,… ,fn be the individual fitness functions that order the solutions according to individual criteria. The total

fitness function is

where i are coefficients to be determined

Page 13: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

N : number of students

M : number of groups

S : number of interests

pi : interest vector of student i

gj : mean interest vector of group j

ij : Kronecker delta

ff11 : Interest Term : Interest Term

SN

SN

fij

N

i

M

jjigp

9

)(91 1

2

1

Page 14: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Problem with Problem with ff11

• 9SN is a too big normalization factor, all decent individuals (with small distances from the mean) will have f1 very close to 1.

• General Solution:

replace with dist

distdist

max

max distaveragez _

SN

ij

N

i

M

j jigp

f

1 1

2)(

1 8.0

Page 15: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

ff22 : Balance Term : Balance Term

N : number of students

M : number of groups

ni : number of students in group j

N

nNM

jj MN

f 2

1

22

2

)(

Page 16: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

M : number of groups

B : number of basic skills

bik: kth skill of student i

ij : Kronecker delta

ff33 : Basic Skills Term : Basic Skills Term

MB

bMB

f

M

j

B

k iijik

9

))max(arg4(91 1

2

3

Page 17: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

M : number of groups

S : number of interests

hik: kth knowledge term of student i

ij : Kronecker delta

jk: 1 if kth interest term is among the first

three interests of group j, 0 otherwise.

ff44 : Knowledge Term : Knowledge Term

M

hM

f

M

j

S

k ijkijik

27

))max(arg4(271 1

2

4

Page 18: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GA parametersGA parameters• Population size: 100• Generations: 30• Crossover probability: 0.4 (single

point)• Mutation probability: 0.001• Equal coefficients

Page 19: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Some entertaining Some entertaining facts about the datasetfacts about the dataset

Page 20: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Basic skillsBasic skillsAverage Experts Beginners

Mathematics 2.83 9 4

Programming 2.75 14 11

English 3.10 19 1

Statistics 2.87 8 1

Page 21: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

InterestsInterests

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Self-organization

Computer Science

Multi-Agent Systems

Evolution

Biology

Neural Nets & Simulation

Information Theory

Economics

Optimization

Cognitive Science

Physics

Social Networks

Psychology

Neuroscience

Philosophy

Anthropology

Quantum Consciousness

Page 22: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

KnowledgeKnowledge

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Computer Science

Evolution

Physics

Optimization

Multi-Agent Systems

Neural Nets & Simulation

Biology

Self-organization

Information Theory

Economics

Philosophy

Cognitive Science

Psychology

Social Networks

Neuroscience

Anthropology

Quantum Consciousness

Page 23: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

TOP 10 knowledge-seeking peopleTOP 10 knowledge-seeking peopleIrina

Anton

Mourad

Zoltan

Anukool

Angel

Lyudmila

Mianlai

Aaron

Arthur

Page 24: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

TOP 10 knowledgeable peopleTOP 10 knowledgeable peopleAnton

Louise

Arndt

Angel

Suzanne

Mark

Nilanjana

Wojciech

Albert

Aaron

Page 25: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Some serious resultsSome serious results

Page 26: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Clustering of interest vectors withClustering of interest vectors with

• Nearest neighbor• Furthest neighbor• Average linkage• Ward linkage

Page 27: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Nearest neighborNearest neighborsrsjri njnixxdistsrd :1,:1)),,(min(),(

FITNESS TERMS: 0,37352071 0,847012823 0,722222222 0,916006652

Page 28: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GROUP 1: Natalia, Nilanjana, Angel, Arndt, Alexander, Wojciech, Frederic, Jason, Gerard, Ferenc, Sergey, Milica, Zoltan, Bartlomiej, Aaron, Pau, Sergey, Jasper, Matthew, Mark, Eva, Volodymyr, Victor, Oleksiy, Anukool, Hilary, Lyudmila, Alex, Vaclav, Anton, Mourad, Nicholas, Arthur, Carolyn, Stanislav, Denis, Suzanne, Albert, Lisa, Vadim, Pavel, Sergiy, Valentin, Mianlai, Gordan

Interests: Self-organization (2,98) Evolution (2,8) Computer Science (2,78)

GROUP 2: LouiseInterests: Anthropology (4) Biology (4) Cognitive Science (4)

GROUP 3: Tatyana Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)

GROUP 4: Gabriella Interests: Computer Science (4) Information Theory (4) Optimization (4)

GROUP 5: Ana-MariaInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)

GROUP 6: Angelica Interests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)

GROUP 7: ChristopheInterests: Cognitive Science (4) Neural Nets & Simulation (4) Psychology (4)

GROUP 8: Irina Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)

Page 29: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Furthest neighborFurthest neighborsrsjri njnixxdistsrd :1,:1)),,(max(),(

FITNESS TERMS: 0,926035503 0,887127441 0,958333333 0,964728892

Page 30: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GROUP 1: Hilary, Angel, Mark, Mourad, Jason Interests: Psychology (3,8) Evolution (3,6) Anthropology (3,2)

GROUP 2: Bartlomiej, Louise, Alexander, Matthew, Valentin, Angelica, VictorInterests: Evolution (3,43) Multi-Agent Systems (3,29) Social Networks (3,29)

GROUP 3: Suzanne, Aaron, Alex, Arndt, WojciechInterests: Evolution (3,57) Biology (3,2929) Self-organization (3,14285714)

GROUP 4: Lisa, GerardInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)

GROUP 5: Sergiy, Albert, ChristopheInterests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)

GROUP 6: Natalia, Nilanjana, Lyudmila, Vaclav, Anton, Frederic, Arthur, Ferenc, Stanislav, Milica, Denis, Sergey, Jasper, Pavel, Mianlai, Volodymyr, Gabriella, Oleksiy,

AnukoolInterests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)

GROUP 7: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, GordanInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)

GROUP 8: Irina, Zoltan, Tatyana, CarolynInterests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)

Page 31: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Average linkageAverage linkage

r sn

i

n

jsjri

sr

xxdistnn

srd1 1

),(1

),(

FITNESS TERMS: 0,821745562 0,879219281 0,902777778 0,951247491

Page 32: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GROUP 1: Natalia, Nilanjana, Angel, Wojciech, Frederic, Jason, Ferenc, Milica, Aaron, Sergey, Jasper, Mark, Volodymyr, Gabriella, Oleksiy, Hilary,

Lyudmila, Vaclav, Anton, Mourad, Arthur, Stanislav, Denis, Suzanne, Pavel, Mianlai

Interests: Self-organization (3,15) Multi-Agent Systems (3,04) Computer Science (3)

GROUP 2: AnukoolInterests: Computer Science (4) Neuroscience (4) Optimization (4)

GROUP 3: Bartlomiej, Lisa, Alexander, Matthew, Valentin, Gerard, Victor Interests: Evolution (3,57) Biology (3,29) Self-organization (3,14)

GROUP 4: Ana-MariaInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)

GROUP 5: Pau, Alex, Arndt, Vadim, Eva, Nicholas, Sergey, Gordan Interests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)

GROUP 6: Angelica, LouiseInterests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)

GROUP 7: Sergiy, Albert, ChristopheInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,333) Biology (3)

GROUP 8: Irina, Zoltan, Tatyana, Carolyn Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)

Page 33: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Ward linkageWard linkage)/()(),( 2

, srsrsr nnxxdistnnsrd

FITNESS TERMS: 0,968195266 0,891630074 0,972222222 0,965034915

Page 34: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GROUP 1: Lisa, Alex, Arndt, Frederic, GerardInterests: Self-organization (3,6) Biology (3,4) Evolution (3,4)

GROUP 2: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, Gabriella, GordanInterests: Physics (2,625) Self-organization (2,625) Computer Science (2,5)

GROUP 3: Bartlomiej, Matthew, Valentin, AlexanderInterests: Economics (3,25) Evolution (3,25) Biology (3)

GROUP 4: Louise, Mianlai, Volodymyr, Victor, Angelica Interests: Computer Science (4,) Multi-Agent Systems (4,) Self-organization (3,8)

GROUP 5: Sergiy, Albert, ChristopheInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)

GROUP 6: Stanislav, Natalia, Denis, Sergey, Vaclav, Anton, Pavel, Ferenc, Milica, OleksiyInterests: Computer Science (3,4) Neural Nets & Simulation (3,4) Economics (3,3)

GROUP 7: Irina, Zoltan, Tatyana, Carolyn Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)

GROUP 8: Hilary, Lyudmila, Nilanjana, Angel, Wojciech, Mourad, Jason, Arthur, Suzanne, Aaron, Jasper, Mark, Anukool

Interests: Biology (3,38) Evolution (3,38) Self-organization (3,23)

Page 35: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

FITNESS TERMS:0,988905325 0,845403674 0,989583333 0,981469795

GROUP 1 Self-organization (4) Neural Nets & Simulation (3,6) Physics (3,4) Arndt, Tatyana, Mianlai, Sergey, Zoltan

GROUP 2 Computer Science (2,56) Neural Nets & Simulation (2,56) Evolution (2,44) Denis, Pau, Alex, Ana-Maria, Lisa, Vadim, Sergiy, Eva, Milica

GROUP 3 Computer Science (3,1) Multi-Agent Systems (3,1) Self-organization (2,9) Stanislav, Natalia, Nilanjana, Gordan, Mourad, Gerard, Ferenc, Victor, Valentin, Oleksiy

GROUP 4 Self-organization (3,43) Evolution (3,14) Psychology (3) Suzanne, Lyudmila, Angel, Wojciech, Mark, Anton, Nicholas

GROUP 5 Cognitive Science (3) Biology (2,83) Evolution (2,67) Christophe, Aaron, Hilary, Albert, Alexander, Frederic

GROUP 6 Economics (3,33) Self-organization (3) Computer Science (2,67) Bartlomiej, Sergey, Jasper, Vaclav, Pavel, Gabriella

GROUP 7 Biology (3,75) Evolution (3,5) Self-organization (3,5) Matthew, Angelica, Louise, Arthur

GROUP 8 Computer Science (3,2) Information Theory (3,2) Philosophy (3,2) Anukool, Irina, Jason, Volodymyr, Carolyn

Page 36: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

Comparison of resultsComparison of results

Nearest Neighbour Furthest Neighbour Average Linkage Ward Linkage GABalance 0,37 0,93 0,82 0,97 0,99Interests 0,85 0,89 0,88 0,89 0,85Basic Skills 0,72 0,96 0,90 0,97 0,99Knowledge 0,92 0,96 0,95 0,97 0,98

Page 37: Clustering under Constraints with Genetic Algorithms by Albert Ali Salah Stanislav Redman Gabriella Kovacs

GOOD BYE, CSSS 2002GOOD BYE, CSSS 2002