This session was delivered at SQLServer UG group meetup. This is pretty much 101 on AzureML offering which allows easy creation of trained models and their deployment for prediction purpose. It does not get into details of all algorithms , process of cleaning up data or tuning - sweeping, bagging/boosting/bootstrapping .....
Text of AzureML – zero to hero
AzureML Zero to Hero Govind Kanshi MTC Bangalore 2nd August 2014
What we will cover AzureML- What it enables Examples Upload data/understand explore it Develop model/evaluate it/deploy it
What this discussion is not about Data Science/Big Data defn/use etc ML Advanced topics Feature Engineering which features are useful/cleaning/dropping For PCA kind of work use R today Individual algorithm discussion/deep dive. Model tuning(Parameter sweep) or other techniques boosting/bagging Overcoming Data vagaries
What you should walk out with Excitement and confidence that ML with AzureML is doable by all of us as long as we are curious and patient. AzureML is democratized platform for learning from data ensuring better informed decisions. It helps to bring sophisticated algorithms and mechanisms in easy to use way for masses and high end researchers today.
What are we trying to do Learn from existing Data to do prediction on data Classification Put labels Regression - price, Recommendation Rank choices Examples classify different behavior, price,recommend, find anamoly Explore data form natural groupings based on some distance formula Clustering
Demo Deployed model for public dataset to classify if person has diabetes Deployed model to predict Decibels of noise How old is this stuff term regression firstly appears in the Galtons (1822- 1911) biological works. Y = a_1 * X_1 + ... + a_n * X_n... Solve for ...
What did we see Exposed Web service in Raw format to do prediction as request- response
Demo Walkthrough of the model creation for Classification Possibly choose another algorithm to compare/evaluate
What did we see AzureML studio Experiments/Datasets/Web services Web Services RR or Batch mode Algorithms Classification, Regression, Recommendation, Ranking Data Ingestion, cleansing, massaging, R Integration Dataset/Experiments are immutable new versions can be deployed
What did we do(typical AzureML path) Define the goal regression or classification or recommendation Create a model and train it using dataset Get data Cleanup the data or replace missing data if required Use the appropriate algorithm/train it Score the model with test data Looked at the algorithm parameters Evaluate Model using metrics Add more algorithms to compare Deploy Model as webservice for request-response mechanism What about batch yes you can. Data exploration visualization of data/results
Evaluate Models summary(classification) Confusion Matrix Precision - (TP / (TP+FP) ) Recall - (TP / (TP + FN)) F1-score ROC curve + AUC - Area under ROC curve Actual Predicted class yes no yes True positive (TP) False negative (FN) no False positive (FP) True negative (TN)
Issues to think about Cleaning/choosing right data points Missing data/transforming data/dropping data/relationship between features Evaluating the algorithm, comparing, tuning the parameters, relearning Which algorithm to choose(Boolean classification vs 10 class vs ranking), Data has many attributes 1000s to 5 digits, vs very less data or very sparse/noisy data What loss function, hyper parameter to aim for Explain the output black box vs decision trees Online/Active Learning
Machine Learning Resources Coursera Machine Learning class https://www.coursera.org/course/ml Access to AzureML it is in preview http://www.youtube.com/watch?v=wjTJVhmu1JM Draft of Alex Smola and Vishy book on ML: http://alex.smola.org/drafts/thebook.pdf Elements of Statistical Learning Hastie, Tibshirani et al: http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Information Theory, Inference, and Learning Algos David Mackay: http://www.inference.phy.cam.ac.uk/mackay/itila/ Datasets - http://archive.ics.uci.edu/ml/datasets.html Official AzureML tutorials/Video walkthroughs - https://azure.microsoft.com/en-us/documentation/services/machine-learning/
Advanced topics Other topics How to use various input data cleanup procedures(dropping/adding/correlated features) How to publish Web service to Azure Market Place($) - https://azure.microsoft.com/en-us/documentation/articles/machine-learning-publish-web- service-to-azure-marketplace/ How do you version assets/dag Techniques to overcome vagaries of data Stratification- sampling for training and testing within classes to overcome issues in data samples representation k-fold CV - data is split randomly into k subsets + each subset is used for testing and the remainder for training. This is repeated and results averaged. CV uses sampling without replacement. Bootstraping - uses sampling with replacement to form the training set. Increasing performance of Model Bagging - Combining predictions by voting or averaging (for numeric prediction). Boosting - Uses voting/averaging but models are weighted according to their performance. Parameter sweeping Regularization parameter handling Penalty for overfitting Understanding the algorithm performance/visualization of the algorithm path when possible. Associated statistics(confidence/distributions)