Upload
mohammad-ramsey
View
39
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Project 1: KDD 2009 Orange Challenge. COMP 4332 Tutorial 3 Feb 2 Yin Zhu [email protected]. All information on this website. http://www.kddcup-orange.com /. Record KDD Cup Participation. The story behind the challenge. French Telecom company Orange . - PowerPoint PPT Presentation
Citation preview
1
COMP 4332 Tutorial 3
Feb 2
Yin Zhu
Project 1:
KDD 2009 Orange Challenge
2
All information on this website• http://www.kddcup-orange.com/
3
KDD Cup Participation By Year
45 5724 31
136
1857
102
3768
95128
453
050
100150200250300350400450500
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
Year # Teams
1997 45
1998 57
1999 24
2000 31
2001 136
2002 18
2003 57
2004 102
2005 37
2006 68
2007 95
2008 128
2009 453
Record KDD Cup Participation
4
The story behind the challenge• French Telecom company Orange.• Task: predict the propensity of customers to
• switch provider (churn), • buy new products or services (appetency), or • buy upgrades or add-ons proposed to them to make
the sale more profitable (up-selling)
• Estimate the churn, appetency and up-selling probability of customers.
5
Train and deploy requirements– About one hundred models per
month– Fast data preparation and
modeling– Fast deployment
Model requirements– Robust– Accurate– Understandable
Business requirement– Return of investment for the whole
process
• Input data• Relational databases• Numerical or categorical• Noisy• Missing values• Heavily unbalanced distribution
• Train data• Hundreds of thousands of instances• Tens of thousand of variables
• Deployment• Tens of millions of instances
Data, constraints and requirements
6
In-house systemFrom raw data to scoring models 0,n
1,n
0,n
1,n
0,1
0,n
0,n
0,n
1,1
1,1
1,n
0,n
0,n
1,1
1,n1,n
1,n1,n
0,n
1,n
1,1 0,n
1,n
1,1
0,n
1,1
0,n
1,1
0,n
1,n
0,n
1,1
1,1
0,n
1,n
1,1
1,n(1,1)
1,1
0,n
1,n
0,n
1,1
0,n
0,n
0,1
1,1
1,n
1,n
0,1
0,n
1,1
1,1
0,1
0,n
Heri tage tiers
Heri tage offre commerciale
0,n 1,n
1,1
0,n
(1,1)
0,n
0,n
0,n 0,1
0,1
0,n
1,n
1,n0,n
1,n
0,n
(1,1)
Fu appartient type FU
1,1
1,n
Offre
Id offreLibel l? offre
<pi>
Produit & Service
Id PSDate fin val idi t? du P&SDate d ut val idi t? du P&S閎Date cr tion du P&S閍Libel l? P&S
<pi>
Identi t? T iers
Id identi t? tiersLoginType identi t? tiers
<pi>
O compos de PS閑
El ent De Parc閙
Id EDPDate derni e uti l isation EDP鑢Date premi e uti l isation EDP鑢
<pi>
Mod e Conceptuel de Donn s鑜 閑
Mod e : MCD PAC_v4鑜
Package :
Diagramme : T iers Services
Auteur : claudebe Date : 14/06/2005
Version :
PS a pour FU
T uti l ise IT
EDP souscri t ds O
Date d ut souscription offre閎Date fin souscription offre
DD
<O>
CRU concerne FU
Gamme
Id gammeLibel l? gammeDate cr tion gamme閍Date fin de gamme
<pi>
G compos de PS閑
Fonction Usage
Id fonction d'usageLib l? fonction usage閘
<pi>
T d ient EDP閠
Date d ut d ention EDP閎 閠Date fin d ention EDP閠
DD
<O>
Compte Facturation
Id compte facturationDate d ut val idi t? compte facturation閎Date fin val idi t? compte facturation
<pi>
F ise pour CF閙
Compte Rendu Usage
Id compte rendu usageDate d ut CRU閎Date fin CRUVolume descendant CRUVolume montant CRUType transmission
<pi>
IT g e CRU閚 鑢
CRU g ? par EDP閚 閞
Ligne Facture
Id l igne de factureLigne affich sur facture閑Montant HTMontant TTC
<pi>
Type Ligne Facture
Id type l igne factureLibel l? type l igne facture
<pi>
Facture
Id factureDate h nce facture閏 閍
<pi>
LF correspond ? EDP
LF compose F
EDP factur? sur CF
T iers
Id tiersPr om tiers PP閚Nom tiers PPNom marital PPGenre PPDate naissance tiers PPDate cr tion tiers閍Date cl ure tiers魌Date modification tiersType T iers
<pi>
T a pour relation avec T
Foyer
Id foyerDate cr tion foyer閍Date fin foyerNb personnes foyer
<pi>
Adresse
Id adresseCode postal distributionCommuneNb habitants communeD artement閜
<pi>
F a pour A
Date d ut adresse閎Date fin adresse
DD
T a pour F
Date d ut appartenance foyer閎Date fin appartenance foyerRole tiers ds foyer
DDVA1Type Relation T iers
Id type relationDate cr tion type de relation tiers閍Libel l? type relation tiers
<pi>
Statut Op ateur閞
Id statut op ateur閞Libel l? statut op ateur閞
<pi>
Operateur
Id op ateur閞Libel l? op ateur閞
<pi>
T a pour S
Date d ut statut tiers閎Date fin statut tiers
DD
CSP
Id CSP 350Libel l? CSP 350Id CSP 23Libel l? CSP 23Id CSP 5Libel l? CSP 5
<pi>
T a pour CSP
LF a pour TLF
Classification Offre
Id classification offreLibel l? classi fication offre
<pi>
O positionn ds C閑
CO hi archie閞
Groupe de CRU
Id groupe de CRU <pi>
CRU appartient ? la CCRU
Cercle Relationnel
Id CRLib l? cercle relationnel閘
<pi>
CRU a pour OCR
CRU a pour DCR
Coordonn s T iers閑
Id coordonn tiers閑Date cr tion coordonn閍 閑Libel l? coordonn tiers閑
<pi>
T ti tulaire CT
C correspond ? M
Donn s payeur閑
Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers r lamation acti fs閏Nb dossiers recouvrementNb dossiers r lamation閏Niveau risque courantNiveau risque pr ent閏閐
Classe de risque
Id classe risqueLibel l? classe risqueLibel l? court classe risqueNiveau risque minimumNiveau risque maximum
<pi>
T a pour CR
Date d ut tiers ds classe risque閎Date fin tiers ds classe risque
DD
Offre compos閑
Id offre compos閑Libel l? offre compos閑
<pi>
Offre commerciale
Id offre commercialeLibel l? offre commercialeDate cr tion offre閍Date cl ure offre魌
<pi>
O fai t partie OC
Date d ut rattachement offre閎Date fin rattachement offre
DD
EDP correspond PS
Positionnement classification
Id positionnementLibel l? positionnement
<pi>
P dans O P hi archie閞
CRU Enchainement
M ia閐
Id m ia閐Libel l? m ia閐
<pi>
EDP a EU
moisvaleur
VA6N10
T payeur du CF
DP pour O
Etat Usage
Id EUlibel l? at usage閠
<pi>
Type de fonction d'usage
id type FUlib typ FU
<pi>
Customer
Services
Products
Call details
…
• Data warehouse• Relational data base
• Data mart• Star schema
• Feature construction• PAC technology• Generates tens of thousands
of variables
• Data preparation and modeling• Khiops technology
Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …
scoring model
Data feeding
PAC
Khiops
7
Design of the challenge• Orange business objective
• Benchmark the in-house system against state of the art techniques
• Data• Data store
• Not an option• Data warehouse
• Confidentiality and scalability issues• Relational data requires domain knowledge and specialized skills
• Tabular format• Standard format for the data mining community• Domain knowledge incorporated using feature construction (PAC)• Easy anonymization
• Tasks• Three representative marketing tasks
• Requirements• Fast data preparation and modeling (fully automatic)• Accurate• Fast deployment• Robust• Understandable
8
Data sets extraction and preparation• Input data
• 10 relational table• A few hundreds of fields• One million customers
• Instance selection• Resampling given the three marketing tasks• Keep 100 000 instances, with less unbalanced target distributions
• Variable construction• Using PAC technology• 20000 constructed variables to get a tabular representation• Keep 15 000 variables (discard constant variables)• Small track: subset of 230 variables related to classical domain knowledge
• Anonymization• Discard variable names, discard identifiers• Randomize order of variables• Rescale each numerical variable by a random factor• Recode each categorical variable using random category names
• Data samples• 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set
9
Scientific and technical challenge• Scientific objective
• Fast data preparation and modeling: within five days• Large scale: 50 000 train and test data, 15 000 variables• Hetegeneous data
• Numerical with missing values• Categorical with hundreds of values• Heavily unbalanced distribution
• KDD social meeting objective• Attract as many participants as possible
• Additional small track and slow track• Online feedback on validation dataset• Toy problem (only one informative input variable)
• Leverage challenge protocol overhead• One month to explore descriptive data and test submission protocol
• Attractive conditions• No intellectual property conditions• Money prizes
10
Data• Each customer is a data instance with three labels:
churn, appetency and up-selling (-1 or 1). • The feature vector for each customer has two versions:
small (230 variables) large (15,000 variables sparse !)
• For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical.
11
Training and Testing • Training: 50,000 samples with labels for churn,
appetency and up-selling • Testing: 50,000 samples without labels
• TASK: predicting a score for each customer in each task
• Play with the data: DEMO in R
12
Binary classification. But predicting a Score???
http://www.kddcup-orange.com/evaluation.php
13
How AUC is calculated?• Sort the predicted scores:
• • -- score of i-th sample, and • -- true label (-1 or 1) of the i-th sample
• For each , use it as a threshold:• For samples 1 to i, classify them as negative (-1)• For sample i+1 to n, classify them as positive (+1)• Calculate Sensitivity = tp/pos, Specificity = tn/neg and a point in the
curve is obtained.
• Calculate the area under the curve.
14
How to deal with Categorical values• Binarization:
• { A, B, C } -> Create 3 binary variables
• Ordinalization: • { A, B, C } -> {1, 2, 3}
15
Project 1 Requirement • Deadline: 15 March 2012• Team: 1 or 2 students. • Do the competition
• Register at http://www.kddcup-orange.com/register.php• Download the data• Try classifiers and ensemble methods, and submit your result
• 50% score for your ranking on the website• 50% score for report and what you have tried and more
importantly what you have found• Preprocessing steps• Classifiers• Ensemble methods
16
Assignment 1• Deadline: 25 Feb 2012• Data exploration and experiment plan
• What you have found on the data, e.g. data imbalance, various statistics over the data, data preprocessing methods you want to apply or have applied, etc.
• A plan on what classification methods (svm, knn, naivebayes, etc.) and ensemble methods you want to try. You should be familiar with the tools and their I/O formats.
• At least three-page report
• Basically, Assignment 1 is a mid-term/progress report for the project.
17
Winning methodsFast track:- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)
- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.
- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.
Slow track:- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.
- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.
- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.
-(+: small dataset unscrambling)
18
Fact Sheets:Preprocessing & Feature Selection
0 20 40 60 80
Principal Component Analysis
Other prepro
Grouping modalities
Normalizations
Discretization
Replacement of the missing values
PREPROCESSING (overall usage=95%)
Percent of participants
0 10 20 30 40 50 60
Wrapper with search
Embedded method
Other FS
Filter method
Feature ranking
FEATURE SELECTION (overall usage=85%)
Percent of participants
Forward / backward wrapper
19
Fact Sheets:Classifier
0 10 20 30 40 50 60
Bayesian Neural Network
Bayesian Network
Nearest neighbors
Naïve Bayes
Neural Network
Other Classif
Non-linear kernel
Linear classifier
Decision tree...
CLASSIFIER (overall usage=93%)
Percent of participants
- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.
- Less than 50% regularization (20% 2-norm, 10% 1-norm).
- Only 13% unlabeled data.
20
Fact Sheets: Model Selection
0 10 20 30 40 50 60
Bayesian
Bi-level
Penalty-based
Virtual leave-one-out
Other cross-valid
Other-MS
Bootstrap est
Out-of-bag est
K-fold or leave-one-out
10% test
MODEL SELECTION (overall usage=90%)
Percent of participants
- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).
- About 10% used unscrambling.
21
Run in parallel
Multi-processor
None
>= 32 GB
> 8 GB
<= 8 GB <= 2GB
Fact Sheets: Implementation
Java
Matlab
C C++
Other (R, SAS)
Mac OS
Linux Unix Windows
Memory
Operating System
Parallelism
Software Platform