Lessons from 2MM machine learning models

Preview:

Citation preview

Kaggle

The home of data science

GE Flight Quest 2Optimize flight routes basedon weather & traffic

$250,000122 teams

Hewlett Foundation: Automated Essay ScoringDevelop an automated scoring algorithmfor student-written essays

$100,000155 teams

Allstate Purchase Prediction ChallengeDevelop an automated scoring algorithmfor student-written essays

$50,0001,570 teams

Merck Molecular Activity ChallengeHelp develop safe and effective medicinesby predicting molecular activity

$40,000236 teams

Higgs Boson Machine Learning ChallengeUse the ATLAS experiment toidentify the Higgs boson

$13,0001,302 teams

Age Income Default

58 $95,824 True73 $20,708 False59 $82,152 False66 $25,334 True

Age Income Default

73 $53,44561 $36,67947 $90,42244 $79,040

Training Data Test Data

The Kaggle Approach

Mapping Dark Matter

Competition Progress

Accuracy(lower is better)

Week 1 Week 3 Week 5 Week 7 End

.0150

.0170Martin O’LearyPhD student in Glaciology, Cambridge U

“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”

“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”

Mapping Dark Matter

Competition Progress

Accuracy(lower is better)

Week 1 Week 3 Week 5 Week 7 End

.0150

.0170

Martin O’LearyPhD student in Glaciology, Cambridge U

Marius CobzarencoGrad student in computer vision, UC London

Ali Haissaine & Eu Jin LocSignature Verification, Qatar U & Grad Student @ Deloitte

Other

deepZot (David Kirkby & Daniel Margala)Particle Physicist & Cosmologist

We’ve worked with many of the world’s largest companies

Healthcare & Pharma

Consumer Internet

Finance IndustrialConsumerMarketing

Oil& Gas

$50b+Beverage

Co.

Global Bank

Top CreditCard

Issuer

Top 5 E&P

Top 20 E&P

That submit over 100K machine learning models per month

May-10 May-11 May-12 May-13 May-14 May-150

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Monthly Submissions to Kaggle Competitions

There’s a cookbook for winning competitions on structured data. It starts with exploring the data.

2. Create and select features

3. Parameter tuning and ensembling

A second cookbook is emerging on computer vision and speech problems. It involves using convolutional neural networks.

The vast majority of time is spent training algorithms when CNNs are applied.

There are the problems that land in the middle…

Anthony Goldblooma@kaggle.com650 283 9781

Recommended