Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Machine learning for a busy developerLeonid Igolnik
About me
5/31/13
• Based in California
• Java Developer for over 11 years
• More then 13 years of SaaS experience
• Love to Travel
AGENDA
5/31/13
AGENDA
5/31/13
AGENDA
5/31/13
AGENDA
• What is machine learning
• R for Machine Learning
• R demo
• Further reading
5/31/13
Machine learning
5/31/13
Defining machine learning
“"Field of study that gives computers the ability to learn without being explicitly programmed” - Arthur Samuel 1959
5/31/13
Defining machine learning
• Statistics is set of tools help humans learn more about the world so the can make better decisions
• Machine learning is about teaching computers about the world so they can use the knowledge to perform tasks
• Intersection of mathematics, statistics, computer science and software engineering
5/31/13
Problems we are solving
• Job Title clustering
• Predicting time to hire based on location, job type, source type etc.
• Predict probability of hire of a candidate given job type, skills, past employer, source type etc.
• Use number of hires in taleo to predict overall job market trends
• And many more
5/31/13
R for machine learningR For Machine Learning
R for Machine learning
Language and environment for statistical computing and is an Open Source alternative to S
R for Machine learning
• Comes with comprehensive ecosystem of extensions: CRAN
• Provides variety of statistical and graphical data analysis tools
http://cran.us.r-project.org/
R for Machine learning
http://www.revolutionanalytics.com/what-is-open-source-r/companies-using-r.php
R for Machine learning
Does not always scale well with large data sets
"The best thing about R is that it was developed by statisticians.
The worst thing about R is that... it was developed by statistician” -
Bo Cowgill, Google
Statistics you say ?
5/31/13
Teapots and other kitchen tools
5/31/13
5/31/13
What can you use R for ?
• Log analysis
• Getting insights into your data
• Creating pretty charts for management ….
• Fun
5/31/13
HELLO world
5/31/13
R Basics
• Workspaces
• Variables
• X = 1 or Y <- 3
• Functions
• C (1, 2, 3)
• [1] 1 2 3
• Comments
• 1 + 1 # this is a comment
5/31/13
R Data types
• Numeric
• Integer
• Complex
• Logical: TRUE or FALSE
• Character
• Factors: aka Enum
5/31/13
R Data types
• Vectors
• V1 = c(1 , 2, 3, 4)
• V2 = c (3, 4, 5)
• V3 = c(V1, V2)
• V4 = 5 * V3
• Matrix
• M1 = matrix(c(1,2,3,4,5,6), nrows=2)
• M2 = matrix(c(1,2,3,4,5,6), nrows=2, byrow=ROW)
• M3 = t(M2) # transpose
5/31/13
R Data types
• Lists
• 1:5 [1] 1 2 3 4 5
• 2^(1:5) [1] 2 4 8 16 32
• Missing values: NA
• Indexing
• Letters[1:3]
• Letters[c(7,9)]
• Letters[-c(1:15)]
5/31/13
Sounds like matlab, no ?
http://www.math.umaine.edu/~hiebeler/comp/matlabR.pdf MATLAB® / R Reference, by David Hiebeler
5/31/13
Your mission should you ….
5/31/13
Your second mission if you choose to accept it….
5/31/13
5/31/13
Kernel density estimation
“In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. “ - Wikipedia
5/31/13
Generalized linear model
“In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.” - Wikipedia
5/31/13
Typical flow
5/31/13
Typical flow
5/31/13
Model
• Gather
• Cleanse
• Explore
• Theorize
• Create training set
• Train
• Validate
• Refine
• Port
5/31/13
Further reading
http://www.amazon.com/Machine-Learning-Hackers-Drew-Conway/dp/1449303714
5/31/13
Further reading
http://cran.r-project.org/doc/manuals/R-intro.pdf
5/31/13
Further reading
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
5/31/13
Further reading
https://class.coursera.org/datasci-001/class
5/31/13
Q&A
5/31/13
5/31/13
Open Middleware 2.0 community & concept
proj. art. Natalia Borowicz
Douglas TaitOracle
Marcin NowakOrange Labs
www.openmiddleware.pl
5/31/13
Thank you !!!
5/31/13