112
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt KU Leuven, Department of Computer Science http://people.cs.kuleuven.be/~bettina.berendt/teaching ast update: 25 November 2015

Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Embed Size (px)

Citation preview

Page 1: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

1

Knowledge and the Web –

Inferring new knowledge from data(bases):

Knowledge Discovery in Databases

Bettina Berendt

KU Leuven, Department of Computer Science

http://people.cs.kuleuven.be/~bettina.berendt/teaching

Last update: 25 November 2015

Page 2: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

2

Where are we?

Page 3: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

3

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 4: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

4

What should we recommend to a customer/user?

Page 5: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

5

What‘s spam and what isn‘t?

Page 6: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

6

Classification / prediction: how is that done?

In which weather will someone play (tennis etc.)?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Page 7: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

7

Classification / prediction: What makes people happy?

Page 8: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

8

“Classification along a numerical scale“: other forms of sentiment analysis

8

Page 9: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

9When we don‘t know the classes yet, but need to discover them: What “news stories“ are there today?

9

Page 10: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

10

What „circles“ of friends do you have?

Page 11: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

11

What „circles“ of friends do you have?

Page 12: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

12Topic detection: What topics exist in a collection of texts, and how do they evolve?

News texts, scientific publications, speeches, …

Page 13: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

13

From your questions to the speakers

These days you hear a lot about Big Data . Nobody seems to have a really good definition for it though. Do you see linked data as a part of Big Data  or more as something separate.

Page 14: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

14A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (1)

Page 15: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

15A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (2)

Page 16: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

16A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (3)

Page 17: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

17

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 18: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

18

Forms of data analysis

• Confirmatory• Hypothesis testing• Experimental procedure, data gathered for this purpose• Inferential statistics• Causality

• Exploratory• Data mining• Already-existing data• Data mining & machine learning models• “Correlation“ (in a wide sense)

• Different basic assumptions, different evaluation methodologies, even when they use the same models (e.g. regression)!

Page 19: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

19

Styles of reasoning

• Descriptive vs. predictive

• Deductive vs. inductive inference

• Data mining prediction is always inductive inference!

Page 20: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

20

From your questions

Are there any economic indicators, related to the (country of representation of a) speaker that influence how many speeches are given by a certain country in the European parliament?

Are economically more powerful countries more influential in the European parliament?

Why does Germany have so much influence on European politics or is this a false statement?

Page 21: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

21

Empiricism and apophenia

21

Page 22: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

22Empiricism and apophenia: correlation, causation, and instrumentality

22

Page 23: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

23“Correlation replaces causation“: Business logic and prediction vs. explanation ...

23

Page 24: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

24

A related issue: number of data points / From your questions

Does the weather in Finland during the European Parliament elections affect the voting behaviour of the Finnish people?

Page 25: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

25

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 26: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

26Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

26

The KDD process: The output

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

Page 27: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

27

The process part of knowledge discovery

CRISP-DM • CRoss Industry

Standard Process for Data Mining

• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

Page 28: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

28

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining sometimes = KD,

sometimes = ML

Page 29: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

29

How much time will you actually spend modelling?

Page 30: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

30

Standard data mining algorithms work on single tables

Important Q for data preparation: How to get from an RDF graph to a table?

Page 31: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

31

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 32: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

32

Descriptive and predictive modelling / learning

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Page 33: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

33

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

...

Page 34: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

34

A simple descriptive statistic: Correlation

0 5 10 15 20 250

10

20

30

40

50

60

70

80

90

y1

0 5 10 15 20 25

-300

-250

-200

-150

-100

-50

0

50

y2

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y3

0 5 10 15 20 25

-100

-80

-60

-40

-20

0

20

y4

Page 35: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

35

“Truly numerical data“: Pearson correlation

Page 36: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

36

From your questions

Is there a correlation between the countries of the speakers who give speeches about the environment and the countries that have the best environmental policies? (pollution, renewable energy, waste generation, etc.)

Page 37: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

37

Rank data: Spearman‘s rank correlation coefficient

Page 38: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

38

Unclear to me / From your questions

Is there a correlation between BBC coverage and the topic of the talks given at the European Parliament?

Is there a correlation between the government type of a country and how much its members talk about democracy?

Page 39: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

39Understand your data (1): Understand your concepts and how your variables measure them

Page 40: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

40

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 41: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

41

Attributes

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Page 42: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

42

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes”

But: number of attributes may vary in practice

Possible solution: “irrelevant value” flag Related problem: existence of an attribute

may depend of value of another one Possible attribute types (“levels of

measurement”, aka “scales of measurement”):

Nominal, ordinal, interval and ratio

Page 43: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

43

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 44: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

44Task: align example measures, scale of measurement, and allowed operations

Example Scale level operations

Temperature (celsius)

Grades at school/university

Pass or no pass (exam)Metres

Temperature („warm“, „cold“, ...)

Weather („good“, „bad“)

Weather („sunny“, „windy“, „cold crisp day“, ...)

Likert-scale values („on a scale of 1-7, ...“)

Duration of work tasks (in minutes)

ECTS credits

NominalOrdinalIntervalratio

=, ≠<, >+, -*, /%modemedianarithmetic meangeom. mean

Page 45: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

45

Nominal quantities

Values are distinct symbols Values themselves serve only as labels or

names Nominal comes from the Latin word for name

Example: attribute “outlook” from weather data

Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values

(no ordering or distance measure) Only equality tests can be performed

Page 46: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

46

Ordinal quantities

Impose order on values But: no distance between values defined Example:

attribute “temperature” in weather data Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense

Example rule:temperature < hot Þ play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

Page 47: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

47

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units

Example 1: attribute “temperature” expressed in degrees Fahrenheit

Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense

Zero point is not defined!

Page 48: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

48

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point

Example: attribute “distance” Distance between an object and itself is zero

Ratio quantities are treated as real numbers All mathematical operations are allowed

But: is there an “inherently” defined zero point?

Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)

Page 49: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

52

Understanding your data (2): Visualize!

0 5 10 15 20 250

10

20

30

40

50

60

70

80

90

y1

0 5 10 15 20 25

-300

-250

-200

-150

-100

-50

0

50

y2

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y3

0 5 10 15 20 25

-100

-80

-60

-40

-20

0

20

y4

Page 50: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

53

Understanding your data (3): How to visualize non-numerical data?

Is there a correlation between the government type of a country and how much its members talk about democracy?

How could you visualize data on this to avoid drawing wrong conclusions already at the outset?

Page 51: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

54

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 52: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

55Supervised and unsupervised learning and examples dealt with here

• Supervised learning

• Classification / classifier learning

• regression

• Unsupervised learning

• Association rule mining

• Clustering

Page 53: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

56

A question to the speakers that I don‘t quite understand

A lot of hierarchies in RDF specifications are built using some human compromise between the properties of a concept and the hierarchy in which the concept is classified. Unsupervised learners already outperform humans in some classification  tasks.

How does this automatisation influence the availability of linked open data?

Page 54: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

57

How to: our proposal

• Basic KDD techniques: frame your research question in terms of one of these tasks, use software to analyse your data (e.g. RapidMiner)

• Advanced KDD techniques (topic detection, sentiment analysis): use 3rd-party software (Sebastijan will provide a list)

• More advanced ideas? Ask / consult with us!

Page 55: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

58

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 56: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

59

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

For the sake of the argument, let us rephrase this a bit to give a typical classification task (see later for a more appropriate formalization):

People with what features (feature values) get a Nobel Prize?

Page 57: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

60

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class Will illustrate key ideas with ID3, a very

simple decision-tree learning algorithm

Page 58: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

61

Which attribute to select?

Page 59: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

62

Which attribute to select?

Page 60: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

63

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain

Page 61: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

64

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

Page 62: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

65

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

Page 63: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

66

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029

bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

Page 64: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

67

Continuing to split

gain(Temperature ) = 0.571 bits

gain(Humidity ) = 0.971 bits

gain(Windy ) = 0.020 bits

Page 65: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

68

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

Page 66: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

69

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!

Page 67: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

70

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

Page 68: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

71

Variants

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes,

missing values, noisy data other measures instead of information gain

(details see exercise session / individual)

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Page 69: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

72

Classification rules

Popular alternative to decision trees Antecedent (pre-condition): a series of tests

(just like the tests at the nodes of a decision tree)

Tests are usually logically ANDed together (but may also be general logical expressions)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

Individual rules are often logically ORed together

Conflicts arise if different conclusions apply

Page 70: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

73

An example

If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

Page 71: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

74

Transition: Trees for numeric prediction

Regression: the process of computing an expression that predicts a numeric quantity

Regression tree: “decision tree” where each leaf predicts a numeric quantity

Predicted value is average value of training instances that reach the leaf

Model tree: “regression tree” with linear regression models at the leaf nodes

Linear patches approximate continuous function

Page 72: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

75

An example

……………

40FalseNormalMildRainy

55FalseHighHot Overcast

0TrueHigh Hot Sunny

5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Page 73: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

76

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 74: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

77

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

...

Page 75: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

78

Lead question

“How does the dependent variable depend on the independent one?“

“Can we predict the likely value of the dependent variable for a new data instance (with a given value of the independent variable)?“

Page 76: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

79

79

Introduction to Linear Regression(the statistical approach)

The Pearson correlation measures the degree to which a set of data points form a straight line relationship.

Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data.

Slides 44-49: slightly adapted from https://home.ubalt.edu/tmitch/631/PowerPoint_Lectures/chapter17/chapter17.ppt

Page 77: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

80

80

Introduction to Linear Regression (cont.)

Any straight line can be represented by an equation of the form Y = bX + a, where b and a are constants.

The value of b is called the slope constant and determines the direction and degree to which the line is tilted.

The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.

Page 78: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

81

Page 79: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

82

82

Introduction to Linear Regression (cont.)

How well a set of data points fits a straight line can be measured by calculating the distance between the data points and the line.

The total error between the data points and the line is obtained by squaring each distance and then summing the squared values.

The regression equation is designed to produce the minimum sum of squared errors.

Page 80: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

83

83

Introduction to Linear Regression (cont.)

The equation for the regression line is

Page 81: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

84

Page 82: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

85

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

Human Development Index

...

Multiple regression

(details: see exercise session)

Page 83: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

86From your questions

Is there a correlation between the government type of a country and how much its members talk about democracy?

This has (assumed) categorical predictors, which can be modelled by dummy variables in a linear regression.

Dummy variables

Page 84: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

88

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

Page 85: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

89

Logistic regression – input data

Page 86: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

90

Logistic regression – fitting a curve

Page 87: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

91

Logistic regression - prediction

Page 88: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

92

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

Note: Logistic regression also exists in multivariate form (= with multiple predictor variables)

Page 89: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

93

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 90: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

94

From your questions

To what extent are a politician‘s topics of choice influenced by * their field of study during higher education?

* phrasing: See remark on “correlation vs. causation“ above!

Are speeches in the European Parliament related to what the public think or search online?

Page 91: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

95Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)

Where to put: spaghetti,

butter?

Page 92: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

96

Data

"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 93: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

97Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 94: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

98

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (2)

Page 95: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

99

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (3)

Page 96: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

100

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (4)

Page 97: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

101

More formally: Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets Spaghetti: support = 3 (60%) tomato sauce: support = 3 (60%) bread: support = 4 (80%) butter: support = 1 (20%)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 98: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

102

Contd.

step 2: large 1-itemsets

Spaghetti

tomato sauce

bread

candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 99: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

103

step 3: large 2-itemsets {Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}

candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)

step 4: large 3-itemsets { }

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Contd.

Page 100: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

104

From itemsets to association rules

Schema: If subset then large k-itemset with support s and confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)

Example:

If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)

Page 101: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

105

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Page 102: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

106

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 103: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

107

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

Att

rib

ute

2

Page 104: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

108Concepts in Clustering

Defining distance between points Euclidean distance

any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)

A good clustering is one where (Intra-cluster distance) the sum of distances between objects in the same

cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

||

||

RQ

RQ

Page 105: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

109

K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 106: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

110

K-means algorithm

Page 107: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

111

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Page 108: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

112

Clustering non-numerical data

(to follow)

Page 109: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

113

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Page 110: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

114

Next lecture

More on KDD concepts and methods

for your projects

Page 111: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

115Supervised and unsupervised learning and examples dealt with here

• Supervised learning

• Classification / classifier learning

• regression

• Unsupervised learning

• Association rule mining

• Clustering

What‘s the human input in both types?

Page 112: Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

116

References / background reading; acknowledgements

The slides are based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and

Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

In particular, pp. 8-57 are based on the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/

(chapters 1-4):

http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or

http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)

Scales (aka levels) of measurement are explained well here:

http://en.wikipedia.org/wiki/Level_of_measurement [15 Nov 2014]