If you can't read please download the document
Upload
shayla
View
43
Download
1
Embed Size (px)
DESCRIPTION
LIACS Data Mining course. Arno Knobbe Arne Koopman. an introduction. Course Textbook. Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0-12-088407-0 by Ian Witten and Eibe Frank. Course Information. New Website: - PowerPoint PPT Presentation
Citation preview
LIACS Data Mining coursean introductionArno Knobbe Arne Koopman
Course TextbookData MiningPractical Machine Learning Tools and Techniquessecond edition, Morgan Kaufmann, ISBN 0-12-088407-0by Ian Witten and Eibe Frank
Course InformationNew Website:http://www.liacs.nl/~akoopman/DaMi/Old website discontinued: http://www.liacs.nl/~joost/DM/CollegeDataMining.htmNew lecturers new styleOld book. This may change next yearUpdated slides + some new materialPractical exercisesNew style of examfewer definitions, more understanding and applyingold exams should not be usedexam preparation (Dec 8) important
Course Outline08-SepKnobbetoday15-SepKnobbe22-SepKoopman29-SepKnobbe06-OctKnobbepractical exercise13-OctKoopman20-OctKoopman27-OctKnobbe03-NovKnobbepractical exercise10-NovKoopmanpractical exercise17-Novno lecture!24-NovKoopmanstart at 9:00!01-DecKoopman08-DecKoopman,Knobbeexam preparation!14-Jan, 10:00-13:00exam
Introduction Data Miningan overview and some examples
Data Mining definitionsData Mining:
the concept of extracting previously unknown and potentially useful information from large sets of data.
secondary statistics: analyzing data that wasnt originally collected for analysis.
Data Mining, the big ideaOrganizations collect large amounts of dataOften for administrative purposesLarge body of experienceLearning from experience
GoalsPredictionOptimizationForecastingDiagnostics
2 StreamsMining for insightUnderstanding a domainFinding regularities between variablesGoal of Data Mining is mostly undefinedInterpretable modelsExamples: Medicine, production, maintenance
Black-box MiningDont care how you do it, just do it wellOptimizationExamples: Marketing, forecasting (financial, weather)
example: Direct MailOptimize the response to a mailing, by targeting only those that are likely to respond:more responsefewer lettersCustomer informationresponse3%Customer informationtest mailingfinalmailingresponse30%remainder
example: BioinformaticsFind genes involved in disease (Parkinsons, Celiac, Neuroblastoma)Measurements from patients (1) and controls (0)Gene expression: measurements of 20k genesdataset 20,001 x 100
Challengesmany variablesfew examples (patients), testing is expensiveinteractions between genes
Data Mining paradigmsClassificationbinary class variablepredict class of future casesmost popular paradigmClusteringdivide dataset into groups of similar casesRegressionnumeric target variableAssociationfind dependencies between variablesbasket analysis,
ClassificationPredict the class (often 0/1) of an object on the basis of examples of other objects (with a class given).0.2
Building (inducing) a decision treeAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYesAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYes
Applying a classifier (decision tree)New customer: (House = Rent, Age = 32, )prediction = Yes
Graphical interpretationdataset with two variables + 1 class (+/-)graphical interpretation of decision tree
Graphical interpretationdataset with two variables + 1 class (+/-)other classifiersSupport Vector MachineNeural Network
Applications of DMMarketingoutgoingincomingBioinformatics & MedicineFraud detectionRisk managementInsuranceEnterprise resource planning
Rhinoplastic surgery
Chart1
10
50
80
222
222
202
80
94
00
33
33
all patients
pre E1c > 3
VAE improvement
nr. patients
histogram over VAE improvement
Sheet2
BinFrequencyBinFrequency
0100
0.550.50
1810
1.5221.52
22222
2.5202.52
3830
3.593.54
4040
4.534.53
5353
Sheet2
3
1.5
1.5
1
0.5
2.5
2
1.5
1.25
2.5
1
2
0.25
3
2
0.25
0
2.25
1.25
3
1.75
2.5
2
2.5
1.5
1.5
1
2.5
1.5
1.5
1.5
3
2
1.5
2
2.5
2
3.5
1.25
2.5
2.25
2.5
1.75
1.5
2
2.5
2
2.5
1.5
3.25
1
1.5
0.75
3
2.5
1
1
3.5
2.5
2.5
2.75
3
1.5
2
2
3.5
1.5
0.5
2
2
3
2
2.5
3
2
2.25
2.25
2
2
1.5
2
1.25
1.5
0.5
1
3.25
5
4.5
3.5
2
1.5
2.25
3.5
2.25
1.25
3.25
1.75
4.5
3.5
4.5
5
4.75
pre E1c
VAE improvement
scatter plot pre E1c vs. VAE improvement
Blad1
all patients
pre E1c > 3
VAE improvement
nr. patients
histogram over VAE improvement
delta VAEpre E1c
631
31.51
31.51
211
10.51
52.51
421
31.51
2.51.251
52.51
211
421
0.50.251
631
421
0.50.251
001
4.52.251
2.51.251
631
3.51.751
52.51
421
52.51
31.51
31.51
211
52.51
31.51
31.51
31.51
631
421
31.51
421
52.51
421
73.52
2.51.252
52.52
4.52.252
52.52
3.51.752
31.52
422
52.52
422
52.52
31.52
6.53.252
212
31.52
1.50.752
632
52.52
212
212
73.52
52.52
52.52
5.52.752
633
31.53
423
423
73.53
31.53
10.53
423
423
633
423
52.53
633
423
4.52.253
4.52.253
423
423
31.53
423
2.51.253
31.53
10.530
2130.5
6.53.2531
10541.5
94.542
73.542.5
4243
31.543.5
4.52.2544
73.544.5
4.52.2545
2.51.254
6.53.254
3.51.754
94.55
73.55
94.55
1055
9.54.755
InfraWatch: monitoring of infrastructureContinuous monitoring of a large bridge Hollandse Brug145 sensorstime-dependent, at frequencies up to 100Hzmulti-modal (sensor, video, differen freq.)managing large data quantities, >1 Gb per day
InfraWatch: monitoring of infrastructure34 `geo-phones' (vibration sensors)44 embedded strain-gauges, 47 gauges outside20 thermometersvideo cameraweather station
InfraWatch sensors
Real-world application:Maintenance planning at KLMRoutine checks of aircraftsMaintenance requires up to 10k different partsOrdering parts incurs delay (costs) but so does stockingIn theory 10k individual predictionsInputmaintenance historyflight history, Sahara/North PoleOnly few parts predictable
Cashflow OnlineOnline personal finance overviewAll bank transactions are loaded into the applicationtransactions are classified into different categoriesData Mining predicts category
67 CategoriesGas Water LichtOnderhoud huis en tuinTelefoon + Internet + TVContributie (sport-)verenigingenLevensverzekering / LijfrenteRente ontvangenBoodschappenHypotheekrenteNaar spaarrekeningGeldopname/chipknipVerzekeringen overigLoterijenCadeau'sInterne boekingVakantie & RecreatieUitgaan, hobby's en sportCreditcardZiektekostenverzekeringBrandstofWoonhuis / OpstalverzekeringHuishouden overigSchool- en StudiekostenInkomsten overigKleding & SchoenenLenenOpenbaar vervoer/Taxi
Fragmented results:Boodschappen (groceries)Contributie
Decision Tree over all categoriestruefalse
Data Mining at LIACSApplicationsbioinformatics (LUMC)law enforcement (KLPD, NFI)rhinoplastic surgery (NKI)Hollandse Brug (Strukton, RWS, Reef Infra)Complex datagraphical data (molecules)relational data (criminal careers)stream data (sensor-data, click-streams)