72
Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Embed Size (px)

Citation preview

Page 1: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Magister Teknologi InformasiUniversitas Indonesia2012

Data Mining

Page 2: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

More data is generated: Bank, telecom, other

business transactions ...

Scientific Data: astronomy, biology, etc

Web, text, and e-commerce

More data is captured: Storage technology

faster and cheaper DBMS capable of

handling bigger DB

Page 3: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

We have large data stored in one or more database/s.

We starved to find new information within those data (for research usage, competitive edge, etc).

We want to identify patterns or rules (trends and relationships) in those data.

We know that a certain data exist inside a database, but what are the consequences of that data’s existence?

Page 4: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

There is often information “hidden” in the data that is not readily evident

Human analysts may take weeks to discover useful information

Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Page 5: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Data Mining is a process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions (Cabena et al. 1998).

Data mining: discovering interesting patterns from large amounts of data (Han and Kamber, 2001).

Page 6: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Definition from [Connolly, 2005]: The process of extracting valid,

previously unknown, comprehensive, and actionable information from large databases and using it to make crucial business decisions.

The thin red line of data mining: it is all about finding patterns or rules by extracting data from large databases in order to find new information that could lead to new knowledge.

Page 7: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

Page 8: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

______

______

______

Transformed Data

Patternsand

Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

Integration

Understanding

DATAWarehouse

Knowledge

Data Mining in Knowledge Discovery Process

Page 9: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

ClusteringClassificationAssociation RulesOther Methods:

Outlier detection Sequential patterns Prediction Trends and analysis of changes Methods for special data types, e.g., spatial data

mining, web mining …

9

Page 10: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Association rules try to find association between items in a set of transactions.

For example, in the case of association between items bought by customers in supermarket:

90% of transactions that purchase bread and butter also purchase milk

Antecedent: bread and butterConsequent: milkConfidence factor: 90%

10

Page 11: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

A transaction is a set of items: T={ia, ib,…it} T I, where I is the set of all possible items

{i1, i2,…in} D, the task relevant data, is a set of

transactions (database of transactions). Example:

items sold by supermarket (I:Itemset): {sugar, parsley, onion, tomato, salt, bread, olives, cheese, butter}

Transaction by customer (T): T1: {sugar, onion, salt} Database (D): {T1={salt, bread, olives}, T2={sugar, onion, salt},

T3={bread}, T4={cheese, butter}, T5={tomato}, …}

11

Page 12: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

An association rule is the form: P Q, where P I, Q I, and P Q =

Example: {bread} {butter, cheese} {onion, tomato} {salt}

12

Page 13: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Support of a rule P Q = Support of (P Q) in D sD(P Q ) = sD(P Q) percentage of transactions in D containing P and Q. #transactions containing P and Q divided by

cardinality of D.

Confidence of a rule P Q cDP Q) = sD(P Q)/sD(P) percentage of transactions that contain both P and

Q in the subset of transactions that contain already P. 13

Page 14: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Thresholds: minimum support: minsup minimum confidence: minconf

Frequent itemset P support of P larger than minimum

supportStrong rule P Q (c%)

(P Q) frequent, c is larger than minimum confidence

14

Page 15: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

For rule {A} {C}:support = support({A, C}) = 50%confidence = support({A, C})/support({A}) = 66.6%

For rule {C} {A}:support = support({A, C}) = 50%confidence = support({A, C})/support({C}) = 100.0%

15

Transaction ID

Items Bought

2000 A, B, C

1000 A, C

4000 A, D

5000 B, E, F

Frequent Itemset

Support

{A} 75%

{B} 50%

{C} 50%

{A,C} 50%

Min. support 50%Min. confidence 50%

Page 16: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Input A database of transactions Each transaction is a list of items (Ex. purchased by

a customer in a visit) Find all strong rules that associate the

presence of one set of items with that of another set of items. Example: 98% of people who purchase tires and

auto accessories also get automotive services done There are no restrictions on the number of items in

the head or body of the rule. The most famous algorithm is APRIORI

16

Page 17: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also

be a frequent itemset▪ i.e., if {AB} isa frequent itemset, both {A} and

{B} should be a frequent itemset Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate

association rules.Source: [Sunysb, 2009]

Page 18: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Consider a database, D, consisting of 9 transactions.

Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22%).

Let minimum confidence required is 70%.

We have to first find out the frequent itemset using apriori algorithm.

Then, association rules will be generated using min. support & min. confidence

TID List of Items

T100 I1, I2, I5

T100 I2, I4

T100 I2, I3

T100 I1, I2, I4

T100 I1, I3

T100 I2, I3

T100 I1, I3

T100 I1, I2 ,I3, I5

T100 I1, I2, I3

Page 19: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Step 1: Generating 1-itemset Frequent Pattern

Itemset Sup.Count

{l1} 6

{l2} 7

{l3} 6

{l4} 2

{;5} 2

Itemset Sup.Count

{l1} 6

{l2} 7

{l3} 6

{l4} 2

{;5} 2

Scan D for count of each candidate

Compare candidate support count with minimum support count

C1 L1

•The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.•In the first iteration of the algorithm, each item is member of the set of candidate

Page 20: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Step 2: Generating 2-itemset Frequent PatternItemset

{l1,l2}

{l1,l3}

{l1,l4}

{l1,l5}

{l2,l3}

{l2,l4}

{l2,l5}

{l3,l4}

{l3,l5}

{l4,l5}

Itemset Sup.Count

{l1,l2} 4

{l1,l3} 4

{l1,l4} 1

{l1,l5} 2

{l2,l3} 4

{l2,l4} 2

{l2,l5} 2

{l3,l4} 0

{l3,l5} 1

{l4,l5} 0

Itemset Sup.Count

{l1,l2} 4

{l1,l3} 4

{l1,l5} 2

{l2,l3} 4

{l2,l4} 2

{l2,l5} 2

Generate C2 candidates from L1

Scan D for count of each candidate

Compare candidate support count with minimum support count

C2 C2

L2

Page 21: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1JoinL1 to generate a candidate set of 2-itemsets, C2.

Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table).

The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

Note:We haven’t used Apriori Property yet.

Page 22: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Step 3: Generating 3-itemset Frequent Pattern

Itemset

{l1,l2,l3}

{l1,l2,l5}

Itemset Sup.Count

{l1,l2,l3} 2

{l1,l2,l5} 2

Generate C3 candidates from L2

Scan D for count of each candidate

Compare candidate support count with minimum support count

C3

C3L3

Itemset Sup.Count

{l1,l2,l3} 2

{l1,l2,l5} 2

Page 23: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property.

In order to find C3, we compute L2JoinL2. C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},

{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.

Page 24: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent.

For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

Page 25: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.

Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C having minimum support.

Page 26: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Step 4: Generating 4-itemset Frequent Pattern

The algorithm uses L3JoinL3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent.

Thus, C4= φ, and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm.

What’s Next ? These frequent itemsets will be used to generate strong association rules( where strong association rules satisfy both minimum support & minimum confidence).

Page 27: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Step 5: Generating Association Rules from Frequent Itemsets

Procedure: For each frequent itemset “I”, generate all

nonempty subsets of I. For every nonempty subset s of I, output the rule

“s (I-s)” if support_count(I) / support_count(s) ≥ min_conf where min_conf is minimum confidence threshold.

Page 28: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

In our example: We had L = {{l1},{l2},{l3},{l4},{l5},

{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},{l2,l5},{l1,l2,l3},{l1,l2,l5}}.▪ Lets take I = {l1,l2,l5}▪ Its all nonempty subsets are {l1,l2}, {l1,l5},

{l2,l5}, {l1}, {l2}, {l5}

Page 29: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Let minimum confidence thresholdis , say 70%.

The resulting association rules are shown below, each listed with its confidence. R1: {I1,I2} {I5}

▪ Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% ▪ R1 is Rejected.

R2: {I1,I5} {I2} ▪ Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%▪ R2 is Selected.

R3: {I2,I5} {I1}▪ Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% ▪ R3 is Selected.

Page 30: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

R4: {I1} {I2,I5} Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% R4 is Rejected.

R5: {I2} {I1,I5} Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected.

R6: {I5} {I1,I2} Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% R6 is Selected.

In this way, We have found three strong association rules.

Page 31: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Learn a method for predicting the instance class from pre-labeled (classified) instances

Many approaches: Statistics, Decision Trees, Neural Networks, ...

Page 32: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Prepare a collection of records (training set ) Each record contains a set of attributes, one of

the attributes is the class. Find a model for class attribute as a function of the

values of other attributes (decision tree, neural network, etc)

Prepare test set to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

After happy with the accuracy, use your model to classify new instance

32

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 33: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

33

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 34: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 35: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Page 36: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

Page 37: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Direct Marketing Goal: Reduce cost of mailing by targeting a set of

consumers likely to buy a new cell-phone product. Approach:

▪ Use the data for a similar product introduced before. ▪ We know which customers decided to buy and which

decided otherwise. This {buy, don’t buy} decision forms the class attribute.

▪ Collect various demographic, lifestyle, and company-interaction related information about all such customers.▪ Type of business, where they stay, how much they earn, etc.

▪ Use this information as input attributes to learn a classifier model.

37From [Berry & Linoff] Data Mining Techniques, 1997

Page 38: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Fraud Detection Goal: Predict fraudulent cases in credit card

transactions. Approach:

▪ Use credit card transactions and the information on its account-holder as attributes.▪ When does a customer buy, what does he buy, how often he

pays on time, etc

▪ Label past transactions as fraud or fair transactions. This forms the class attribute.

▪ Learn a model for the class of the transactions.▪ Use this model to detect fraud by observing

credit card transactions on an account.

38

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 39: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Customer Attrition/Churn: Goal: To predict whether a customer is likely

to be lost to a competitor. Approach:

▪ Use detailed record of transactions with each of the past and present customers, to find attributes.▪ How often the customer calls, where he calls, what time-

of-the day he calls most, his financial status, marital status, etc.

▪ Label the customers as loyal or disloyal.▪ Find a model for loyalty.

39

From [Berry & Linoff] Data Mining Techniques, 1997

Page 40: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Helps users understand the natural grouping or structure in a data set.

Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group.

Clustering: unsupervised classification: no predefined classes

40

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters.

Page 41: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Find “natural” grouping of instances given un-labeled data

Page 42: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

A good clustering method will produce high quality clusters in which: the intra-class similarity (that is within a cluster) is high. the inter-class similarity (that is between clusters) is

low. The quality of a clustering result also depends on

both the similarity measure used by the method and its implementation.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

The quality of a clustering result also depends on the definition and representation of cluster chosen.

42

Page 43: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Partitioning algorithms: Construct various partitions and then evaluate them by some criterion.

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. There is an agglomerative approach and a divisive approach.

43

Page 44: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Partitioning method: Given a number k, partition a database D of n objects into a set of k clusters so that a chosen objective function is minimized (e.g., sum of distances to the center of the clusters). Global optimum: exhaustively enumerate all

partitions – too expensive! Heuristic methods based on iterative

refinement of an initial partition

44

Page 45: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clusters

Result represented by a so called dendrogram Nodes in the dendrogram represent possible clusters can be constructed bottom-up (agglomerative

approach) or top down (divisive approach)

45

Clustering obtained by cutting the dendrogram at a desired level: each connectedconnected component forms a cluster.

Page 46: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

cluster similarity = similarity of two most similar members

- Potentially long and skinny clusters

+ Fast

Page 47: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

04589

07910

036

02

0

5

4

3

2

1

54321

0458

079

03

0

5

4

3

)2,1(

543)2,1(

1

2

3

4

5

8}8,9min{},min{

9}9,10min{},min{

3}3,6min{},min{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

ddd

ddd

ddd

Page 48: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

045

07

0

5

4

)3,2,1(

54)3,2,1(

1

2

3

4

5

0458

079

03

0

5

4

3

)2,1(

543)2,1(

04589

07910

036

02

0

5

4

3

2

1

54321

5}5,8min{},min{

7}7,9min{},min{

5,35),2,1(5),3,2,1(

4,34),2,1(4),3,2,1(

ddd

ddd

Page 49: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

04589

07910

036

02

0

5

4

3

2

1

54321

045

07

0

5

4

)3,2,1(

54)3,2,1(

1

2

3

4

5

0458

079

03

0

5

4

3

)2,1(

543)2,1(

5},min{ 5),3,2,1(4),3,2,1()5,4(),3,2,1( ddd

Page 50: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

cluster similarity = similarity of two least similar members

+ tight clusters

- slow

Page 51: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

04589

07910

036

02

0

5

4

3

2

1

54321

0459

0710

06

0

5

4

3

)2,1(

543)2,1(

1

2

3

4

5

9}8,9max{},max{

10}9,10max{},max{

6}3,6max{},max{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

ddd

ddd

ddd

Page 52: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

04589

07910

036

02

0

5

4

3

2

1

54321

0459

0710

06

0

5

4

3

)2,1(

543)2,1(

1

2

3

4

5

0710

06

0

)5,4(

3

)2,1(

)5,4(3)2,1(

7}5,7max{},max{

10}9,10max{},max{

5,34,3)5,4(,3

5),2,1(4),2,1()5,4(),2,1(

ddd

ddd

Page 53: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

04589

07910

036

02

0

5

4

3

2

1

54321

0459

0710

06

0

5

4

3

)2,1(

543)2,1(

0710

06

0

)5,4(

3

)2,1(

)5,4(3)2,1(

1

2

3

4

5

10},max{ )5,4(,3)5,4(),2,1()5,4(),3,2,1( ddd

Page 54: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Clustering obtained by cutting the dendrogram at a desired level: each connectedconnected component forms a cluster.

Dendogram: Hierarchical Clustering

54

Page 55: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Understanding the Data Data Cleaning

Missing Values, Noisy Values, Outliers Dates Nominal/Numeric Discretization

Normalization Smoothing Transformation Attribute selection

55

Page 56: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Can't be expected to be expert in all fields, but understanding the data can be extremely useful for data mining. What data is available? What available data is actually relevant or useful? Can the data be enriched from other sources? Are there historical datasets available? Who is the Real expert to ask questions of? (Are the results at all sensible? Or are they completely

obvious?) Answers to these questions before embarking on

a data mining project are invaluable later on.56

Page 57: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Number of instances available: 5000 or more for reliable results

Number of attributes: Depends on data set, but any attribute

less than 10 instances is typically not worth including

Number of instances per class: More than 100 per class If very unbalanced, consider stratified

sampling57

Page 58: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Goal: maximizing data quality Assess data quality Correct noted deficiencies

Assess data quality Analyze data distribution: is there any strange

data distribution? Analyze data elements: check inconsistencies,

redundant, missing values, outlier, etc. Conduct physical audit: ensure data recorded

properly, for example: cross check data to customer

Analyze business rules: check data violates business rules

Page 59: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Exclude the attribute for which data is frequently missing

Exclude records that have missing data

Extrapolate missing values from other known values

Use a predictive model to determine a value

For quantitative values, use a generic figure, such as the average.

Page 60: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

We want all dates to be the same.  YYYY MM DD is an ISO standard, BUT it has some issues for data mining. Year 10,000 AD! (we only have 4 digits) Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM-

DD date. Most importantly: Does not preserve intervals,

with/without the –second Other representations:

Posix/Unix System Date:  Number of seconds since 1970

etc

60

Page 61: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Nominal data – without ordering, eg: Sex, Country, etc Some algorithms can't deal with nominal or numeric

attributes. Eg Decision trees deal best with nominal, but Neural Networks and many clustering algorithms require only numeric attribute values.

In case the algorithm requires converting Nominal to Numeric Binary field: One value is 0, other value is 1 (eg gender) Ordered fields: Convert to numbers to preserve order (eg A vs C

grade becomes 4 and 2 respectively) Few Values: Convert each value into a new binary attribute, for

example: possible values for attribute AT are A, B, C, D then you can create 4 new attributes ATa, ATb, ATc, ATd with each attribute has value either 0 or 1

Many Values: Convert into groups of values, each with its own (binary) attribute. eg group states in the US into 5 groups of 10.

Unique Values: Ignore identifier like attributes (buang atribut) 61

Page 62: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Some algorithms require nominal or discrete values. How can we turn a numeric value into a nominal value, or a numeric value with a smaller range of values.

Several Discretization techniques. Often called 'binning‘. Equal Width Equal Depth Class Dependent Entropy Fuzzy (Allow some fuzziness as to the edges of the bin) Non-disjoint (Allow overlapping intervals) ChiMerge (Use Chi-Squared Test in the same way as Entropy) Iterative (Use some technique, then minimise classifier error) Lazy (Only discretize during classification (VERY lazy!)) Proportional k-Interval (Number of bins = square root of instances)

62

Page 63: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

We might want to normalise our data such that two numeric values are comparable. For example to compare age and income.

Decimal Scaling: v' = v/10k for smallest k such that max(abs(v'))<1

Eg: -991 and 99, k is 3, and -991 becomes -0.991

Min/Max Normalisation: v'= (v-min)/(max-min) * (newMax-newMin) +

newMin Eg: 73600 in [12000,98000] to [0,1] is 0.716

Zero-mean Normalization: v'=(v-mean)/stddev eg: mean 54000, stddev = 16000, v=73600 then

v' = 1.225

Page 64: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

In the case of noisy data, we might want to smooth the data such that it is more uniform.

Some possible techniques: Regression: Find the function for the data, and

move each value some amount closer to what the function predicts (see classification)

Clustering: Some clustering techniques remove outliers. We could cluster the data to remove these noisy values.

Binning: We could use some technique to discretize the data, and then smooth based on those 'bins'.

Page 65: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Transform data to more meaningful form

For example: Birth date is transformed to Age Date of the first transaction is

transformed to number of days since the customer becomes member

Grades of each course are transformed to cumulative GPA

Page 66: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Before getting to the data mining, we may want to either remove instances or select only a portion of the complete data set to work with.

Why? Perhaps our algorithms don't scale well to the amount of data we have

Techniques: Records selection

▪ Partitioning: Split the database into sections and work with each in turn. Often not appropriate unless the algorithm is designed to do it.

▪ Sampling: Select a random subset of the data and use that which is hopefully representative.

Attribute selection▪ Stepwise Forward Selection: Find the best attribute and add.▪ Stepwise Backward Elimination: Find the worst attribute and

remove.▪ Genetic Algorithms: Use a 'survival of the fittest' along with

random cross-breeding approach▪ etc

Page 67: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

What technique will you use to solve this problem? Given set of applicant attributes (name, salary,

age, etc), you want to decide whether you have to approve customer application on credit card or not.

Given national examination scores, you want to group Kabupatens into three educational level: Good, Average, Poor

You want to suggest to your customer about suitable pant given her/his choice of shirt.

You want to estimate economic growth of Indonesia given some data (GNP, GDP, etc)

Page 68: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Pak Bedu adalah seorang dokter yang ingin menggunakan TI untuk membantunya memutuskan apakah pasiennya terkena kanker atau tidak. Untuk memutuskan hal tersebut, Pak Bedu sudah memiliki data setiap pasien yang meliputi hasil uji dalam 5 test laboratorium serta keputusan apakah dia terkena kanker atau tidak. Berikut ini contoh datanya:

ID T1 T2 T3 T4 T5 Cancer?

P1 1 3 4 2 3 Yes

P2 0 5 1 1 No

P3 1 2 4 2 2 No

P4 2 2 3 1 2 Yes

P5 2 2 3 1 2 No

Masalah apa yang bisa Anda temukandi data Pak Bedu?

Page 69: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Pada saat Pak Bedu memeriksa satu atribut (misal T1) ditemukan 5 atribut yang distinct dengan jumlah sebagai berikut: 1 dengan jumlah 1234 2 dengan jumlah 2037 3 dengan jumlah 1659 4 dengan jumlah 1901 11 dengan jumlah 1

Apa yang bisa Anda simpulkan dengan melihat data tersebut?

Page 70: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Pak Bedu mengembangkan usaha kliniknya di 3 tempat. Pak Bedu menyerahkan sepenuhnya pengelolaan data pasien ke setiap klinik. Pak Bedu ingin mengetahui karakteristik dari pasien untuk usahanya dengan mengumpulkan data dari ketiga kliniknya. Hanya saja Pak Bedu bingung karena setiap klinik memiliki skema data yang berbeda-beda. Apa yang harus Pak Bedu lakukan?

Skema Klinik 1: Pasien( Nama, TglLahir, Tinggi (meter), Berat(kg), JenisKelamin (L/P), Alamat, Provinsi)

Skema Klinik 2: Pasien( Nama, TglLahir, Tinggi (centimeter), Berat(kg), JenisKelamin (P/W), Alamat, Kota )

Skema Klinik 3: Pasien( Nama, Umur, Tinggi (meter), Berat(kg), JenisKelamin (L/P), Kota, Provinsi)

Page 71: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Pak Bedu ternyata juga memiliki usaha Sekolah Tinggi Kesehatan. Sebagai orang yang baik hati, Pak Bedu ingin memberikan beasiswa untuk mahasiswanya yang sedang mengerjakan skripsi. Hanya saja Pak Bedu ingin memperoleh mahasiswa yang memiliki potensi untuk bisa menyelesaikan skripsinya dalam waktu satu semester. Pak Bedu memiliki data nilai mahasiswa yang sudah lulus beserta lama waktu penyelesaian studinya. Skema data yang dimiliki Pak Bedu antara lain: Tabel Mahasiswa(NPM, Nama, Umur, AsalDaerah, LamaSkripsi, LamaStudi), Tabel MataKuliah(Kode, Nama, Kelompok), Tabel Nilai(NPM, KodeMK, Nilai)

Diskusikan apa yang kira-kira bisa Anda lakukan untuk membantu Pak Bedu

Page 72: Magister Teknologi Informasi Universitas Indonesia 2012 Data Mining

Pak Bedu mengembangkan kliniknya sampai 100 cabang. Pak Bedu ingin melihat pola-pola kunjungan pasien, daerah mana yang banyak orang sakit, kapan banyak yang sakit, dsb. Pak Bedu memiliki data tentang klinik dan jumlah total kunjungan pasien setiap bulannya.

Apa yang bisa Anda lakukan?