21
COMP 4332 Tutorial 3 Feb 2 Yin Zhu [email protected] Project 1: KDD 2009 Orange Challenge 1

COMP 4332 Tutorial 3 Feb 2 Yin Zhu [email protected]

Embed Size (px)

DESCRIPTION

Project 1: KDD 2009 Orange Challenge. COMP 4332 Tutorial 3 Feb 2 Yin Zhu [email protected]. All information on this website. http://www.kddcup-orange.com /. Record KDD Cup Participation. The story behind the challenge. French Telecom company Orange . - PowerPoint PPT Presentation

Citation preview

Page 1: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

1

COMP 4332 Tutorial 3

Feb 2

Yin Zhu

[email protected]

Project 1:

KDD 2009 Orange Challenge

Page 2: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

2

All information on this website• http://www.kddcup-orange.com/

Page 3: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

3

KDD Cup Participation By Year

45 5724 31

136

1857

102

3768

95128

453

050

100150200250300350400450500

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

Year # Teams

1997 45

1998 57

1999 24

2000 31

2001 136

2002 18

2003 57

2004 102

2005 37

2006 68

2007 95

2008 128

2009 453

Record KDD Cup Participation

Page 4: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

4

The story behind the challenge• French Telecom company Orange.• Task: predict the propensity of customers to

• switch provider (churn), • buy new products or services (appetency), or • buy upgrades or add-ons proposed to them to make

the sale more profitable (up-selling)

• Estimate the churn, appetency and up-selling probability of customers.

Page 5: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

5

Train and deploy requirements– About one hundred models per

month– Fast data preparation and

modeling– Fast deployment

Model requirements– Robust– Accurate– Understandable

Business requirement– Return of investment for the whole

process

• Input data• Relational databases• Numerical or categorical• Noisy• Missing values• Heavily unbalanced distribution

• Train data• Hundreds of thousands of instances• Tens of thousand of variables

• Deployment• Tens of millions of instances

Data, constraints and requirements

Page 6: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

6

In-house systemFrom raw data to scoring models 0,n

1,n

0,n

1,n

0,1

0,n

0,n

0,n

1,1

1,1

1,n

0,n

0,n

1,1

1,n1,n

1,n1,n

0,n

1,n

1,1 0,n

1,n

1,1

0,n

1,1

0,n

1,1

0,n

1,n

0,n

1,1

1,1

0,n

1,n

1,1

1,n(1,1)

1,1

0,n

1,n

0,n

1,1

0,n

0,n

0,1

1,1

1,n

1,n

0,1

0,n

1,1

1,1

0,1

0,n

Heri tage tiers

Heri tage offre commerciale

0,n 1,n

1,1

0,n

(1,1)

0,n

0,n

0,n 0,1

0,1

0,n

1,n

1,n0,n

1,n

0,n

(1,1)

Fu appartient type FU

1,1

1,n

Offre

Id offreLibel l? offre

<pi>

Produit & Service

Id PSDate fin val idi t? du P&SDate d ut val idi t? du P&S閎Date cr tion du P&S閍Libel l? P&S

<pi>

Identi t? T iers

Id identi t? tiersLoginType identi t? tiers

<pi>

O compos de PS閑

El ent De Parc閙

Id EDPDate derni e uti l isation EDP鑢Date premi e uti l isation EDP鑢

<pi>

Mod e Conceptuel de Donn s鑜 閑

Mod e : MCD PAC_v4鑜

Package :

Diagramme : T iers Services

Auteur : claudebe Date : 14/06/2005

Version :

PS a pour FU

T uti l ise IT

EDP souscri t ds O

Date d ut souscription offre閎Date fin souscription offre

DD

<O>

CRU concerne FU

Gamme

Id gammeLibel l? gammeDate cr tion gamme閍Date fin de gamme

<pi>

G compos de PS閑

Fonction Usage

Id fonction d'usageLib l? fonction usage閘

<pi>

T d ient EDP閠

Date d ut d ention EDP閎 閠Date fin d ention EDP閠

DD

<O>

Compte Facturation

Id compte facturationDate d ut val idi t? compte facturation閎Date fin val idi t? compte facturation

<pi>

F ise pour CF閙

Compte Rendu Usage

Id compte rendu usageDate d ut CRU閎Date fin CRUVolume descendant CRUVolume montant CRUType transmission

<pi>

IT g e CRU閚 鑢

CRU g ? par EDP閚 閞

Ligne Facture

Id l igne de factureLigne affich sur facture閑Montant HTMontant TTC

<pi>

Type Ligne Facture

Id type l igne factureLibel l? type l igne facture

<pi>

Facture

Id factureDate h nce facture閏 閍

<pi>

LF correspond ? EDP

LF compose F

EDP factur? sur CF

T iers

Id tiersPr om tiers PP閚Nom tiers PPNom marital PPGenre PPDate naissance tiers PPDate cr tion tiers閍Date cl ure tiers魌Date modification tiersType T iers

<pi>

T a pour relation avec T

Foyer

Id foyerDate cr tion foyer閍Date fin foyerNb personnes foyer

<pi>

Adresse

Id adresseCode postal distributionCommuneNb habitants communeD artement閜

<pi>

F a pour A

Date d ut adresse閎Date fin adresse

DD

T a pour F

Date d ut appartenance foyer閎Date fin appartenance foyerRole tiers ds foyer

DDVA1Type Relation T iers

Id type relationDate cr tion type de relation tiers閍Libel l? type relation tiers

<pi>

Statut Op ateur閞

Id statut op ateur閞Libel l? statut op ateur閞

<pi>

Operateur

Id op ateur閞Libel l? op ateur閞

<pi>

T a pour S

Date d ut statut tiers閎Date fin statut tiers

DD

CSP

Id CSP 350Libel l? CSP 350Id CSP 23Libel l? CSP 23Id CSP 5Libel l? CSP 5

<pi>

T a pour CSP

LF a pour TLF

Classification Offre

Id classification offreLibel l? classi fication offre

<pi>

O positionn ds C閑

CO hi archie閞

Groupe de CRU

Id groupe de CRU <pi>

CRU appartient ? la CCRU

Cercle Relationnel

Id CRLib l? cercle relationnel閘

<pi>

CRU a pour OCR

CRU a pour DCR

Coordonn s T iers閑

Id coordonn tiers閑Date cr tion coordonn閍 閑Libel l? coordonn tiers閑

<pi>

T ti tulaire CT

C correspond ? M

Donn s payeur閑

Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers r lamation acti fs閏Nb dossiers recouvrementNb dossiers r lamation閏Niveau risque courantNiveau risque pr ent閏閐

Classe de risque

Id classe risqueLibel l? classe risqueLibel l? court classe risqueNiveau risque minimumNiveau risque maximum

<pi>

T a pour CR

Date d ut tiers ds classe risque閎Date fin tiers ds classe risque

DD

Offre compos閑

Id offre compos閑Libel l? offre compos閑

<pi>

Offre commerciale

Id offre commercialeLibel l? offre commercialeDate cr tion offre閍Date cl ure offre魌

<pi>

O fai t partie OC

Date d ut rattachement offre閎Date fin rattachement offre

DD

EDP correspond PS

Positionnement classification

Id positionnementLibel l? positionnement

<pi>

P dans O P hi archie閞

CRU Enchainement

M ia閐

Id m ia閐Libel l? m ia閐

<pi>

EDP a EU

moisvaleur

VA6N10

T payeur du CF

DP pour O

Etat Usage

Id EUlibel l? at usage閠

<pi>

Type de fonction d'usage

id type FUlib typ FU

<pi>

Customer

Services

Products

Call details

• Data warehouse• Relational data base

• Data mart• Star schema

• Feature construction• PAC technology• Generates tens of thousands

of variables

• Data preparation and modeling• Khiops technology

Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …

scoring model

Data feeding

PAC

Khiops

Page 7: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

7

Design of the challenge• Orange business objective

• Benchmark the in-house system against state of the art techniques

• Data• Data store

• Not an option• Data warehouse

• Confidentiality and scalability issues• Relational data requires domain knowledge and specialized skills

• Tabular format• Standard format for the data mining community• Domain knowledge incorporated using feature construction (PAC)• Easy anonymization

• Tasks• Three representative marketing tasks

• Requirements• Fast data preparation and modeling (fully automatic)• Accurate• Fast deployment• Robust• Understandable

Page 8: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

8

Data sets extraction and preparation• Input data

• 10 relational table• A few hundreds of fields• One million customers

• Instance selection• Resampling given the three marketing tasks• Keep 100 000 instances, with less unbalanced target distributions

• Variable construction• Using PAC technology• 20000 constructed variables to get a tabular representation• Keep 15 000 variables (discard constant variables)• Small track: subset of 230 variables related to classical domain knowledge

• Anonymization• Discard variable names, discard identifiers• Randomize order of variables• Rescale each numerical variable by a random factor• Recode each categorical variable using random category names

• Data samples• 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set

Page 9: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

9

Scientific and technical challenge• Scientific objective

• Fast data preparation and modeling: within five days• Large scale: 50 000 train and test data, 15 000 variables• Hetegeneous data

• Numerical with missing values• Categorical with hundreds of values• Heavily unbalanced distribution

• KDD social meeting objective• Attract as many participants as possible

• Additional small track and slow track• Online feedback on validation dataset• Toy problem (only one informative input variable)

• Leverage challenge protocol overhead• One month to explore descriptive data and test submission protocol

• Attractive conditions• No intellectual property conditions• Money prizes

Page 10: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

10

Data• Each customer is a data instance with three labels:

churn, appetency and up-selling (-1 or 1). • The feature vector for each customer has two versions:

small (230 variables) large (15,000 variables sparse !)

• For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical.

Page 11: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

11

Training and Testing • Training: 50,000 samples with labels for churn,

appetency and up-selling • Testing: 50,000 samples without labels

• TASK: predicting a score for each customer in each task

• Play with the data: DEMO in R

Page 12: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

12

Binary classification. But predicting a Score???

http://www.kddcup-orange.com/evaluation.php

Page 13: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

13

How AUC is calculated?• Sort the predicted scores:

• • -- score of i-th sample, and • -- true label (-1 or 1) of the i-th sample

• For each , use it as a threshold:• For samples 1 to i, classify them as negative (-1)• For sample i+1 to n, classify them as positive (+1)• Calculate Sensitivity = tp/pos, Specificity = tn/neg and a point in the

curve is obtained.

• Calculate the area under the curve.

Page 14: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

14

How to deal with Categorical values• Binarization:

• { A, B, C } -> Create 3 binary variables

• Ordinalization: • { A, B, C } -> {1, 2, 3}

Page 15: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

15

Project 1 Requirement • Deadline: 15 March 2012• Team: 1 or 2 students. • Do the competition

• Register at http://www.kddcup-orange.com/register.php• Download the data• Try classifiers and ensemble methods, and submit your result

• 50% score for your ranking on the website• 50% score for report and what you have tried and more

importantly what you have found• Preprocessing steps• Classifiers• Ensemble methods

Page 16: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

16

Assignment 1• Deadline: 25 Feb 2012• Data exploration and experiment plan

• What you have found on the data, e.g. data imbalance, various statistics over the data, data preprocessing methods you want to apply or have applied, etc.

• A plan on what classification methods (svm, knn, naivebayes, etc.) and ensemble methods you want to try. You should be familiar with the tools and their I/O formats.

• At least three-page report

• Basically, Assignment 1 is a mid-term/progress report for the project.

Page 17: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

17

Winning methodsFast track:- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)

- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.

- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.

Slow track:- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.

- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.

- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.

-(+: small dataset unscrambling)

Page 18: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

18

Fact Sheets:Preprocessing & Feature Selection

0 20 40 60 80

Principal Component Analysis

Other prepro

Grouping modalities

Normalizations

Discretization

Replacement of the missing values

PREPROCESSING (overall usage=95%)

Percent of participants

0 10 20 30 40 50 60

Wrapper with search

Embedded method

Other FS

Filter method

Feature ranking

FEATURE SELECTION (overall usage=85%)

Percent of participants

Forward / backward wrapper

Page 19: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

19

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)

Percent of participants

- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.

- Less than 50% regularization (20% 2-norm, 10% 1-norm).

- Only 13% unlabeled data.

Page 20: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

20

Fact Sheets: Model Selection

0 10 20 30 40 50 60

Bayesian

Bi-level

Penalty-based

Virtual leave-one-out

Other cross-valid

Other-MS

Bootstrap est

Out-of-bag est

K-fold or leave-one-out

10% test

MODEL SELECTION (overall usage=90%)

Percent of participants

- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).

- About 10% used unscrambling.

Page 21: COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

21

Run in parallel

Multi-processor

None

>= 32 GB

> 8 GB

<= 8 GB <= 2GB

Fact Sheets: Implementation

Java

Matlab

C C++

Other (R, SAS)

Mac OS

Linux Unix Windows

Memory

Operating System

Parallelism

Software Platform