Data Science Methodolgy

Embed Size (px)

Citation preview

  • 8/19/2019 Data Science Methodolgy

    1/12

    © 2015 IBM Corporation

    Foundational Data Science Methodology

    John B. Rollins, Ph.D.IBM Analytics | IBM Corporation

  • 8/19/2019 Data Science Methodolgy

    2/12

    © 2015 IBM Corporation2

    Introduction

    ! Why we are interested in data science

    -  Solve problems and answer questions

    -  Gain useful insights through modeling to predict outcomes or discover

    underlying patterns

    ! Rapidly evolving technologies

    -  Platform growth

    -  In-database analytics

    -  Text analysis

    -  Automation

  • 8/19/2019 Data Science Methodolgy

    3/12

    © 2015 IBM Corporation3

    Data science methodology

    ! Why?

    -  To provide a guiding strategy

    ! What?

    -  General strategy that guides the processes and activities within a given

    domain

    -  Does not depend on particular technologies or tools

    -  Not a set of techniques or recipes

    -  Provides the data scientist with a framework for how to proceed to obtain

    answers

  • 8/19/2019 Data Science Methodolgy

    4/12

    © 2015 IBM Corporation4

    Methodology diagram

    BusinessUnderstanding

    Data

    Understanding

    DataPreparation

    AnalyticApproach

    DataRequirements

    Data Collection

    Modeling

    Evaluation

    Deployment

    Feedback

  • 8/19/2019 Data Science Methodolgy

    5/12

    © 2015 IBM Corporation5

    Business understanding

    ! Every project begins with business understanding.

    -  Clearly define project objectives and requirements from the business

    perspective… key to a successful solution

    -  Business sponsors most critical in this stage

    •  Define problem and solution requirements

    -  Business sponsors involved throughout the project

    •  Provide domain expertise

    •  Review intermediate findings

    •  Ensure that the work generates the intended solution

    BusinessUnderstanding

  • 8/19/2019 Data Science Methodolgy

    6/12

    © 2015 IBM Corporation6

    Analytic approach

    ! With a clear definition of the business problem, we define the analytic

    approach to solving the problem.

    -  Express problem in context of statistical and machine learning techniques

    -  Identify suitable technique(s)

    -  Examples

    •  Classification to predict response to a promotion ("yes" or "no“)

    •  Clustering  and  Associations for customer segmentation and market basket

    analysis

    AnalyticApproach

  • 8/19/2019 Data Science Methodolgy

    7/12

    © 2015 IBM Corporation7

    Data

    Understanding

    DataRequirements

    Data Collection

    Data compilation

    ! The chosen analytic approach determines the

    data requirements.

    -  Content, formats, representations

    !  Initial data collection is performed.

    -  Available data resources (structured, unstructured,semi-structured) relevant to the problem domain

    -  Decide whether to obtain less-accessible data

    elements

    -  Revise data requirements or collect more data,

    if needed

    ! Then data understanding is gained.

    -  Descriptive statistics and visualization

    -  Content, quality, initial insights about data

    -  Additional data collection to fill gaps, if needed

  • 8/19/2019 Data Science Methodolgy

    8/12

    © 2015 IBM Corporation8

    Data preparation

    ! Data preparation encompasses all activities to construct the data set.

    -  Data cleaning

    •  Missing or invalid values

    •  Eliminating duplicate rows

    • 

    Formatting properly

    -  Combining multiple data sources

    -  Transforming data

    -  Feature engineering

    -  Text analysis

    !  Accelerate data preparation by

    automating common stepsData

    Preparation

  • 8/19/2019 Data Science Methodolgy

    9/12

    © 2015 IBM Corporation9

    Modeling

    Modeling

    ! Modeling focuses on developing models.

    -  Predictive or descriptive models

    -  According to the previously-defined analytic approach

    -  Training set for predictive modeling

    ! Highly iterative process

    -  Intermediate insights " refinements in data preparation & model specification

    -  Multiple algorithms & parameters to find best model for a given technique

  • 8/19/2019 Data Science Methodolgy

    10/12

    © 2015 IBM Corporation10

    Model evaluation

    ! Model evaluation is performed during model development and before

    model deployment.

    -  Understand the model’s quality

    -  Ensure that it properly addresses the business problem

    ! Diagnostic measures

    -  Suitable to the modeling technique used

    -  Testing set

    -  Refine model as needed

    ! Statistical significance tests

    Evaluation

  • 8/19/2019 Data Science Methodolgy

    11/12

    © 2015 IBM Corporation11

    Deployment and feedback

    ! Once finalized, the model is deployed into a production environment.

    -  May be in a limited / test environment until model is proven

    -  Involves additional groups, skills, and technologies

    •  Solution owner

    • 

    Marketing

    • 

     Application developers

    •  IT administration

    Feedback to assess model performance-  Gathering and analysis of feedback for assessment

    of the model’s performance and impact

    -  Iterative process for model refinement and redeployment

    -  Accelerate through automated processes

    Deployment

    Feedback

  • 8/19/2019 Data Science Methodolgy

    12/12

    © 2015 IBM Corporation12

    Ongoing value through good methodology

    ! Methodology diagram illustrates the iterative nature of problem-solving in

    a data science project.

    ! Through feedback, refinement, and redeployment, models are continually

    improved and adapted to evolving conditions.

    ! The model continues to provide value to the organization for as long as

    the solution is needed.