Андрей Гулин "Знакомство с MatrixNet"

Preview:

DESCRIPTION

Семинар «Использование современных информационных технологий для решения современных задач физики частиц» в московском офисе Яндекса, 3 июля 2012Андрей Гулин, разработчик, Яндекс

Citation preview

Moscow, 03.07.2012

Andrey Gulin

MatrixNet

Why is this relevant? — CERN event classification problem

— CERN solution quality 94.4%

— Yandex MatrixNet solution quality 95.8%

Machine Learning — Deterministic processes -> programming, computer

science etc

— Noisy data -> statistics, machine learning etc

— Supervised / Unsupervised / Semisupervised

— Offline/online learning

ML applications — Yandex: ranking, spam classification, user behavior

modeling etc

— CERN: event classification etc

— Finance: fraud detection, credit scoring etc

— …

Binary classification problem — Offline Supervised problem

— Given samples of 2 classes predict the class of unseen sample

— For each sample we know N real valued features {xi}

Solution quality measures — ROC (receiver operating characteristic) curve

— AUC (area under curve)

Solution quality measures — Precision Recall curve

— BEP (Break Even Point), precision == recall

Solution quality measures — Log likelihood / cross entropy = sum {log P}

— Convex function with derivatives

— Used as a proxy for non-continuous functions like AUC/BEP etc

Methods — Nearest Neighbors

— SVM

— Logistic regression (linear regression with logistic transform of the result)

— “Neural” networks = non linear regression

— Decision Trees

— Boosted Decision Trees

Decision Tree F1 > 3

F2 > 3

F1 > 6

Bootstrapping — Take N random samples with replacement from

original set

— Easy way to estimate all sorts of statistics over the set

— Including building model of the set

Boosting — Building strong model as a combination of “weak

models”

— Iterative process, on each iteration we

— Approximate current residual with the best “weak model”

— Scale new “weak model” by small number

— Add it to the solution

— Approximating loss function gradient instead of residual gives gradient boosting

Overfitting

Boosting — Greedy selection and scaling produce regularization

effect

— If “weak model” is a scaled feature, then Boosting produces L1 regularized solution

— If “weak model” is a greedily constructed decision tree, then Boosting gives a form hierarchical sparsity constraint

MatrixNet

MatrixNet — MatrixNet is an implementation of gradient boosted

decision trees algorithm

— MatrixNet is a bit different from standard

—Using Oblivious Trees

—Accounting for sample count in each leaf

Oblivious Trees F1 > 3

F2 > 3

F2 > 3

Accounting leaf sample count — Prefer trees with large average in leafs with many

samples

— F.e. multiplying leaf average by sqrt(N/(N+100)) (N – leaf sample count) produces better model

Questions?