11
4/14/2014 ActivSteps http://activsteps.com/PraticalDataScience.html 1/11 PRACTICAL DATA SCIENCE Module 1: Data Science Essentials Unit 1 - Introduction to Data Science What is Data Science? Disciplines that make up Data Science What does a data scientist do with the data? Pre-requisites and Resources ( Statistics, Mathematics, Computer Science) Business Modeling Why should we build models or use data to run a business? Tribal knowledge/Intuition vs. Evidence What kind of models do data scientists build? Limitations of Models What problems need a prediction? How do you evaluate the accuracy of predictions Understanding Data Understanding Data Types Data Preprocessing and Transformation Approximations and estimations Analyzing networks and graphs Representing data Graphically and analytically Data Science Applications Churn Analysis Data Preprocessing and Transformation Recommendations Pattern recognition and learning algorithms HOME CUSTOMERS EDUCATION RESOURCES ABOUT US JOBS

Activ Steps

Embed Size (px)

DESCRIPTION

Practical Data Science - online course

Citation preview

Page 1: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 1/11

PRACTICAL DATA SCIENCE

Module 1: Data Science Essentials

Unit 1 - Introduction to Data Science

What is Data Science?

Disciplines that make up Data Science

What does a data scientist do with the data?

Pre-requisites and Resources ( Statistics, Mathematics, Computer

Science)

Business Modeling

Why should we build models or use data to run a business?

Tribal knowledge/Intuition vs. Evidence

What kind of models do data scientists build? Limitations of

Models

What problems need a prediction?

How do you evaluate the accuracy of predictions

Understanding Data

Understanding Data Types

Data Preprocessing and Transformation

Approximations and estimations

Analyzing networks and graphs

Representing data Graphically and analytically

Data Science Applications

Churn Analysis

Data Preprocessing and Transformation

Recommendations

Pattern recognition and learning algorithms

HOME CUSTOMERS EDUCATION RESOURCES ABOUT US

JOBS

Page 2: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 2/11

Unit 2 - Reviving with R (Tutorials)

Basic Data Types

Vector

Matrix

List

Data Frame

Data Import and export

Control Structures

Some important R Packages (Plyer, apply)

Unit 3 - Python for Data Analysis

IPython IDE

NumPy Basics

Pandas Basics

Data handling ( Loading, Storage and file formatting)

Data Wrangling (Clean, Transform, Merging)

Handling missing values

Binning, Classing and Standardization

Outlier/Noise

Type Conversion

Unit 4 - Data Analysis

Explorative Basic Data Analysis

Data exploration (histograms, bar chart, box plot, line graph,

scatter plot)

Qualitative and Quantitative Data

Central Tendencies : Mean, Median, Mode

Dispersion : Range, Variance, Standard Deviation

Anscombe's quartet

Other Measures : Quartile and Percentile, Interquartile Range,

Skew and Kurtosis

Relationship between attributes : Covariance, Correlation

Coefficient, ChiSquare

Moment Generating Functions (Random Data)

Principal Component Analysis

Unit 5 - Data Visualization

The science and the art

Page 3: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 3/11

Science of Visualization

Visualization Periodic Table

Aesthetics and Story telling

Data Visualization using d3.js, Google AppEngine & Charts and Ggplot2

Bubble charts

gauge charts

Tree map

Heat map

Motion charts

Force Directed Charts etc.

Unit 6 - Processing Unstructured Data

Text Pre-processing

Regular Expressions

Sentence Splitting and Tokenization

Find Unique words and count

Punctuations and Stopwords

Incorrect spellings

Basic Natural Language Processing

Properties of words

Lemmatization and Term-Document TxD computation

Bag-of-words

Similarity measures (Cosine Similarity, Chi-Square, N Grams)

Part-of-Speech Tagging

Stemming

Chunking

Module 2: Computing at Scale

Unit 7 - Processing Big Data

Essential Hadoop

Distributed Computing for Scale and Price/Performance

HDFS Overview and Architecture

Functional Programming model

Evolution and overview of MapReduce

MapReduce Data flow

Working with Hadoop

Page 4: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 4/11

Different types of Installations.

Demo VM Image setup

Linux and HDFS

Unit 8 - MapReduce (MR) Programming

Core MR Programming

Hadoop Data Types

Basic MapReduce API Concepts

Input Splits, Shuffling, Sorting, Combining

Custom Writable & WritableComparables

Combiners & Partitioners

Streaming (in Python)

Streaming (in Python)

Word Co-occurrence and N-grams

Inverted Index

TF-IDF

Page Rank

Unit 9 - MR Algorithms for Data Scientists

Map Reduce Applications

Graph Processing

Sample ML Algorithm

Pandas Basics

Hadoop eco-system

Sqoop

Flume

Mahout

Unit 10 - Pig : Dataflow Language

Introduction to Pig

Pig Data Model

Input and Output

Relational Operations

User Defined Functions

Pig (Tutorial)

Unit 11 - Hive : Datawarehouse Framework

Page 5: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 5/11

Introduction to Hive

Hive Architecture

Data Definition and Manipulation

Data Model

Data Handling and Modeling

Pig and Hive Comparison

Hive (Tutorial)

Unit 12 - NoSQL Concepts

NoSQL Introduction

ACID vs. BASE

CAP Theorem

NoSQL DBs (Key-value, Columnar, Document, Graph)

NoSQL Modeling

Relational Schema to Key Value and Document Stores

Relational Schema to Graph Stores

Designing a NoSQL Data base for Twitter

Building an Application with NoSQL

Designing a Social Media application

Exploring the design of Twitter.com

Unit 13 - HBase and Neo4j

HBase Overview

HBase Concepts , Architecture

Data Model

Hbase Commands

Neo4j Overview

Concepts and Data Modeling

Cypher Query language

Graph Search and applications

Unit 14 - Cassandra and MongoDB

Cassandra Overview

Cassandra Concepts, Architecture

Page 6: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 6/11

The Cassandra Data Model

Introduction to Clusters

Gossip and Failure Detection

Compaction, Bloom Filters, Tombstones

Reading and Writing Data

Cassandra Commands

MongoDB Overview

Concepts and Architecture

Schema Design

Data Manipulation - CRUD Operations

Aggregations

Module 3: Predictive Analytics

Unit 15 - Statistical Thinking

Probability Concepts

Statistical Distributions

Normal Distribution (when data is continuous numeric variable)

Binomial Distribution (when responses data is binary)

Poisson Distribution (when data is counts based)

Exponential Distribution (useful for survival analysis kind of data)

Central Limit Theorem

Analysis of Variance (ANOVA)

Bayesian Statistics

Bayian analysis

Prior probability (Naive Density Estimator)

Conditional probability(Joint Density Estimator)

Comparing two proportions

Posterior probability

Bayes Theorem

Useful Statistical Inferences about Business Outcomes

Concepts of Hypothesis Testing

Testing for equality of variances of two samples

Comparing the equality of means of two samples

Comparing two proportions

Correlation between two samples

Page 7: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 7/11

Tests on two variables contingency table

Unit 16 - Statistical Analysis

Introduction to Regression

Regression (Linear, Multivariate Regression) in forecasting

Analyzing and interpreting regression results

Multi-collinearity

Logistic Regression

Forecasting

Trend analysis and Time Series

Cyclical and Seasonal analysis

Box-Jenkins method

Smoothing and Moving averages

Auto-correlation

ARIMA - Holt-Winters method

Sales Prediction

Time Series of Decomposition of Cement Sales by quarter.

Predicting the sales for the next four quarters

Module 4: Text Mining and Machine Learning

Unit 17 - Applications of Text Analysis

Fundamentals of Information Retrieval

Data Collection and Structuring

Tools and Techniques for Data Collection from Facebook, Twitter,

etc.,

Data Storage Options, Standardization and Preparation for

Analysis

Text classification and feature selection:

How to use Naive Bayes classifier for text classification

Evaluation systems on the accuracy of text mining

Location Sensitive Hashing

Applications

An introduction to text mining for sentiment analysis.

Text Analysis (Tutorial)

Twitter and Email Analysis

Page 8: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 8/11

Social Graphs and Segmentation

Spam Filtering or Text Classification

Shakespeare Text Analysis

Unit 18 - Social Media Analysis

Data Collection and Structuring

Tools and Techniques for Data Collection from Facebook, Twitter,

etc.,

Data Storage Options, Standardization and Preparation for

Analysis

Mining Social Media

Topic Mining and Trending

Social Media Analytics: Clustering, Regression etc

Twitter Analysis with Mahout

Social Graphs (Neighbor analysis and Community Detection)

Unit 19 - Introduction to Machine Learning

Data Mining and Machine Learning

Decision Trees

Recommender Systems

User based

Item Based

Singular value decomposition-based recommenders

Text classification and feature selection:

How to use Naive Bayes classifier for text classification

Evaluation systems on the accuracy of text mining

Location Sensitive Hashing

Similarity Measures :

Pearson correlation

Spearman correlation

Euclidean distance

Cosine measure

Tanimoto coefficient

Log-likelihood test

Cases

Decision Tree Classifier ( Iris Data Set)

Page 9: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 9/11

Movie Ratings and recommendations

Unit 20 - Predictive Modeling

Pre-processing for Analytics

Creating standard data sets - Training, Testing and Validation

Data Sampling and methods (Data reduction, Modeling, Balancing,

Over/Under-sampling)

Feature selection (Feature creation, bundling, ranking)

Analyzing the goodness of Models

Structure and anatomy of models

Characteristics of good models

Concepts of under-fitting and over-fitting of models

SEvaluating Performance of a model

Likelyhood-ratio

Scoring and Bagging

Similarity Measures :

Confusion matrix

ROC curve

Lift

KS test

Unit 21 - Supervised Learning

Classification

Naive Bayes classifier

Bayesian belief networks

Unit 22 - Unsupervised Learning

Clustering

Similarity measures for grouping objects

Connectivity models (hierarchical clustering)

Partition Clustering

Analyzing clustering results

Using Clustering for Prediction

Clustering Techniques

K-Nearest Neighbor method

Wilson editing and triangulations

K-nearest neighbors in collaborative filtering, digit recognition

Page 10: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 10/11

Mentors

ReddyRaja Annareddy

Founder and CEO, Akrantha Software; Faculty IIIT-Hyd

Surya Putchala

Founder and CEO Zettamine

Pavan Kumar Penjandra

Unit 23 - Other Techniques

Linear learning machines

Prediction using Linear Regression

Parameters Estimation in Regression

Analysis of Regression Results

Use of Regression in Analysis of Variance

Support Vector Machines (SVM)

Ensemble Techniques

Ensemble and Hybrid models

AdaBoost, Random Forests and Gradient boosting machines

Neural Networks and its applications

Perceptron and Single Layer Neural Network, and hand

calculations

Back propagation and conjugant gradient techniques

Applications : Face and Digit Recognition

Face Recognition with SVD, Eigen vectors

Unit 24 - Tackling a Data Science Project

Data Science Development Framework

Cases : Community Detection, Recommender Engine for a Job Portal,

Ad Serving Platform

Analyzing the case study

Developing a Solution Architecture

Developing a Technical Architecture

Develop plan for the Tools and Algorithms

Page 11: Activ Steps

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 11/11

Big Data Engineer

Karthik Reddy

Data Scientist

© 2013 ACTIVSTEPS INC. ALL RIGHTS RESERVED. HOME CUSTOMERS MISSION CONTACT US