27
White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data

White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

White Paper

Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data

Page 2: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 2

Introduction

There’s a lot of hype and hoopla around AI and machine learning today. Whether you’re looking at technology trade publications and Web sites, analyst reports or even the mainstream press, AI is talked about everywhere. And its potential is indeed impressive.

There are some significant factors mitigating the success of AI in the enterprise though,

because it can be hard to integrate into operational processes. Part of the reason is that

AI tools are, to varying degrees, primitive and segregated from mainstream analytics

tools. In reality, AI should be connected to all the other work you do with data; it can’t be

in an isolated zone unto itself. Data Science is not an island.

If you’re going to do machine learning work right, you’re going to need well-honed data

sets on which to build your models. It is not just about cleaning the data. It is about

finding more data to increase accuracy, and discover data that may be more relevant to

the problem at hand.

Getting there isn’t just about “preparing” the data, either. It’s about exploring it and

understanding it – achieving intimacy with the data, grasping its content and meaning in

a fundamental way. It takes way more than simple preparation to do that. It takes deep

analysis and exploration. And that requires advanced analytics and machine learning

work to be done in tandem and in harmony.

The industry and ecosystem have largely segregated these two subdisciplines. Given

that the specializations in each are legitimately different, that is understandable. But –

understandable or not – it’s not something we should settle for. Machine learning without

analytics is machine learning that is less efficient, more expensive and less enlightened.

And given the way society is starting to delegate certain business-critical, and even

socially impactful, decisions to machine learning models, we have a deep obligation to

make those models as accurate as possible. Pairing strong analytics with ML isn’t just

good form; it’s mandatory and essential, from a societal point of view.

The good news is that this integration of analytics and machine learning can be done

relatively easily and effectively, and that can be shown in a practical, matter-of-fact

fashion. In this report, we’ll illustrate how to create a more efficient and effective ML

workflow, carried out using a rich data preparation and exploration platform – in this case

Datameer, combined with a machine learning platform and programming environment –

IBM Cloud Private for Data, utilizing Apache Spark and Jupyter notebooks.

IBM Cloud Private for Data (ICP4D) is an open, cloud-native environment for AI. With

this integrated, fully governed team platform, you can keep your data secure at its

source and add preferred data and analytics microservices flexibly. Datameer – which

is available from ICP4D as an add-on – is an advanced data preparation and exploration

Page 3: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 3

platform that works especially well with big data. We’ll highlight the specific features of

the Datameer platform that complement the machine learning process especially well

and we’ll see how a strong symbiosis between Datameer and ICP4D can be fashioned to

create a faster workflow that produces better ML models and results.

We’ll also show how the cooperative relationship can work both ways – with data being

sent from Datameer to an ICP4D to build and test models, and the results of that testing,

with predictive output, fed back into Datameer for further analysis. We’ll show how this

loop can repeat iteratively – setting up a virtuous analytics-ML workflow cycle.

Page 4: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 4

Our ExampleBefore we cover the rigors of the workflow for ML, it’s probably a good idea to explain

the high-level premise of machine learning in the context of a tabular data set. It’s

actually pretty simple: out of all the columns in the data set, there’s going to be one

whose value you’d eventually like to predict, and there are going to be several others

whose values become germane inputs and influencers into the what the predicted values

will be. The column whose value you will predict is often called the label; the columns

relevant to the predicted values are often called features.

To flesh this out with an example, a data set of customer information, including buying

and demographic information, might become the basis for a model. One column that

indicated a customer’s spending level, using some discrete set of categories, might be

the label. Other columns, with demographic information, including age, gender, marital

status, income, and whether the customer rents or owns her home, might be determined

to be features. Later, when demographic data arrives for new or potential customers, you

can use the model to predict the spending level of each one. This can be of immense

help in running efficient marketing campaigns.

Page 5: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 5

Today’s ML WorkflowIn this section we’ll explain the typical machine learning workflow and how data scientists

tend to execute it with their customary AI platform tools. From there we’ll explain how this

workflow can be carried out more effectively using an AI platform like ICP4D in tandem

with Datameer. This approach will be no less convenient than working just with the AI

tools, and yet will deliver far more in the way of insightful data exploration, precision data

engineering and straightforward data preparation.

Understanding this structure and this little bit of vocabulary can actually get you a long

way toward understanding how machine learning works1. But understanding that much

of the workflow in machine learning involves building a data set with data that’s clean,

accurate, and focused to include just the feature columns (as well as the label column,

of course) gives you the essential insight you need to understand how to do the best ML

work possible.

What should now be clear with respect to the workflow is that it involves equal parts data

exploration (profiling, querying and visualizing the data), preparation (cleansing, de-

duplicating, and properly summarizing or aggregating the data), and feature engineering

(determining which columns seem to be impactful to the value in the label column).

And at a higher level, this work will need to be carried out repeatedly and iteratively,

potentially across numerous data sets, with data that may overlap, but which are

nonetheless distinct.

Tooling

In the world of data science today, much of the work tends to happen inside a

programming and data manipulation environment called a notebook. Notebooks are

Web browser-based documents that contain a combination of text (in markdown format,

specifically), code, and output generated by the code. One way to think of notebooks is

as a combination of a Wiki and a code editor. The code can be executed in place and the

output – quite often the result set of a query – can be expressed in text, tabular format.

There are a few different notebook standards out there but Jupyter notebooks are the

most prevalent, and ICP4D hosts them natively. Notebooks allow the output of various

commands to be viewed inline, within the notebook.

1 There’s lots more, of course, including the numerous machine learning frameworks out there, the algorithms available in each, the groupings those algorithms fall into, the various parameters each framework’s algorithm supports, and the range of values those parameters can be set to.

Page 6: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 6

Figure 1: A notebook with DataFrame schema and table list output

Given what we have discussed so far, it should be clear that notebooks in ICP4D can be

used to explore, query, profile and transform data. But it’s also clear that in order to do

this, data scientists must write the code, and understand the intricacies of connecting to

data.

The latter point has two important consequences:

1. Only data scientists or data engineers, with the right coding skills, understanding of

file and database connection information, and having sufficient file and database

permissions can perform these tasks

2. Even those professionals with sufficient skills and access must manually write code

to perform these tasks, making the work error prone and, in the best case, far less

efficient than it could be.

With this in mind, a better, more efficient workflow would use a platform specifically

designed to accelerate the difficult preparation and exploration tasks – finding the right

data, exploring it to understand its relevance to the problem, shaping it in the right way,

and engineering the features needed to make the ML model hum.

The data scientist can then use the notebook in ICP4D for creating the ML model, and

the ML platform for training and testing it. This would minimize the amount of coding

required. And, as we’ve discussed, once done with the ML modeling, one could also use

the data exploration capabilities of the preparation platform to validate the model across

any number of dimensions to ensure its accuracy.

Page 7: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 7

Our ArchitectureWe can easily design an architecture that conforms to the more efficient workflow, by

combining Datameer and IBM Cloud Private for Data, with Apache Spark and Jupyter

notebooks. This accommodates data science without being rocket science; it’s pretty

straightforward. We’ll use:

• Datameer as our platform for data exploration, transformation and preparation

• ICP4D and Apache Spark as our ML platform, and notebooks as the coding

environment

• Datameer as our tool for post-training model performance evaluation, by analyzing

test output from the model built on Spark in ICP4D

Figure 2: An integrated machine learning architecture

Page 8: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 8

A More Ideal ML WorkflowOur architecture enables a far more efficient – and repeatable – machine learning

workflow. Data starts in Datameer, where it is explored and shaped, allowing for upfront

code-less preparation. Datameer simplifies exploration of the data for relevancy, and

makes for easier feature engineering, to create the most appropriately shaped data set,

the first time around.

Datameer’s output is exported to ICP4D where Spark is used to build, train and test an

ML model. ML model test results are round-tripped back into the exploration platform for

detailed model validation. And if the model’s accuracy is not satisfactory, the data can be

further refined and the entire process repeated, creating a virtuous cycle.

Our workflow, then, has three phases: The preparatory workflow; model design, training and

testing; and model performance validation. The steps involved in each phase are as follows:

Preparatory Workflow

• Profiling

• Exploration

• Enrichment

• Preparation

• Coarse analytics

Model Design, Training and Testing

• Creating a model

• Training and testing it

• Model Performance Validation

• Round-tripping the results from the model testing in the ML environment back to

Datameer

• Performing data exploration to explore all aspects and dimensions of the model to

ensure accuracy on any front

Datameer’s capabilities are designed to help you understand your data better, shape it

in the right manner, and to explore it on a self-service basis. Many of these capabilities

make Datameer a great platform for the machine learning preparatory workflow. We

review some of those features here. In the next section we’ll apply them to a particular

data set that we’ll then use to build a machine learning model.

Page 9: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 9

Data Profiling

While the Python and R programming languages (and various machine learning

frameworks) provide functions to help data scientists profile their data, Datameer

provides that functionality in a visual format, requiring almost no effort at all.

Specifically, by using Datameer’s Inspector, while examining workbook data in

spreadsheet view, users can see vital profiling statistics (like number of rows, number of

distinct values, and the maximum, minimum and mean) on any column, just by selecting

it. Here’s a closeup view of the inspector in data profiling mode:

Figure 3: Profiling a single column in the Datameer Inspector

Page 10: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 10

Users can also view data profiling information for the entire data set by going into Flip

Sheet view.

Figure 4: Profiling an entire data set in Datameer‘s Flip Sheet view

And while it’s nice to have a histogram per column laid out in a single view as it is above,

any one of those histograms can be zoomed to full-screen mode, simply by clicking on it:

Figure 5: Flip Sheet view histogram in full screen

Data profiling, especially using histograms that visualize the distribution of values in a

column, is important functionality in the data science preparatory workflow. Knowing

that a given column has a small number of values may provide a clue that its value can

be predicted by a machine learning model. If the column has a large number of distinct

Page 11: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 11

values, or even a small number if they’re evenly distributed, then it may be important

to predicting the value of another column. Datameer provides this data distribution

information by default. No special effort is required.

Visual Exploration

Visual Explorer (VE) is Datameer’s patent-pending technology for performing visual

analysis on large volumes of data, with amazingly fast interactive performance. This

allows customers to see trends and patterns in the data and do so in an iterative fashion.

Figure 6: Visual Explorer showing the impact of values in one column on the value of another

While the above visualization is easily accessible and helpful, VE is about much more

than quick and simple column charts. Users can easily create more sophisticated

visualizations, get started very quickly and configure things iteratively.

For example, users need only select the columns they’re interested in and VE will

automatically suggest particular visualizations, allowing the user to select one from a

gallery-style user interface:

Figure 7: Visual Explorer‘s suggested visualizations based on the selection of two columns from a data set.

Page 12: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 12

Once a user selects a suggested visualization, it is displayed in large format:

Figure 8: A Visual Explorer visualization in large-format view

From here, the user can easily tune and refine the visualization, changing the particular

columns selected, and even particular column values included. This can continue until

the visualization is precisely configured according to the user’s needs. For example, the

above visualization could be changed to display a multi-line chart, showing one plotted

line per each of five selected values from the legend column:

Figure 9: Adding a legend/color column, as shown at upper-right, and filtering for specific values, at far left

The ability to do this on large volumes of data is very helpful in the creation of accurate

data models for which mere sampling of data must be avoided. The ability to do this

quickly encourages users to investigate their data tenaciously, since they can ask

question upon question with little if any waiting time.

VE enables users to achieve true, deep understanding of their data. And once satisfied

with the visualization, users can generate a worksheet version of the same data, just by

clicking “Create Summary Sheet”:

Page 13: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 13

Figure 10: A summary sheet, created from the modified Visual Explorer output

The above data could then be further enriched with other columns, and the resulting data

set used to train a machine learning model. Alternately, users can click “Create Details

Sheet” to generate a sheet with all columns, rather than just those in the visualization.

Smart Analytics

Based on certain of the same algorithms used in machine learning work, Smart Analytics

is a Datameer feature that can help users understand the relationship between columns

and/or the impact one column has on others. Users can easily build decision tree,

column dependency or clustering models on data to see how data columns relate and

the most important values that lead to specific outcomes, items that are extremely

important in the feature engineering process.

Figure 11: A Smart Analytics decision tree diagram

Page 14: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 14

Easy to Follow Workflow

Beyond these exploration features, working with data in a spreadsheet and formula

environment is highly congruent with the machine learning preparatory workflow. In

Datameer, data is loaded into one sheet then gradually transformed and summarized in

successive sheets in the workbook.

This leaves the lineage of the data fully visible and discoverable. In effect, each tab of the

workbook is a chapter in a story of the data’s evolution, essentially a presentation of the

progressions in the analyst’s thought process in working with the data.

In some ways, the cells of a Datameer workbook are like the cells in a Data Scientist’s

notebook. Each one allows a set of transformations on, and/or analyses of, the data. The

difference is that notebooks have code, while a Datameer workbooks has data rows and

columns with easily readable spreadsheet formulas as well as simple filters, sorts, joins

and unions.

Rich Suite of Functions

Another major part of feature engineering is creating new columns from the source

data to feed the right columns and values to the AI/ML model. Doing so requires a

comprehensive and large suite of easy to apply functions to that transform and aggregate

data as well as calculating new values based on statistical or calculation functions.

Datameer offers over 270 powerful, yet simple to use functions to transform and

massage the data into the right shape. These elements provide much of the same

expressiveness and power of code, but on declarative functions instead of imperative

stepwise lines of code. They’re accessible to more authors and more readable for a

broader set of consumers.

These powerful spreadsheet-style formulas will come in handy when performing feature

engineering. ML models typically require numerical, statistical and encoded values. We

can apply transformation, calculation and statistical functions to the core data to produce

new columns that provide highly tuned values to the ML model in the right format.

Page 15: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 15

Doing AI in Datameer and SparkNotebooks are a great place to build and test models. Focusing the Notebooks on this

task greatly reduces coding efforts. You’ll get the fastest workflow with the greatest

reusability by combining the preparation & exploration platform with the machine learning

platform. In this section we’d like to show you how that can be done. We’ll go through a

real-world example with a well-known public data set, with anonymized U.S. census data

on personal income. We’ll describe how we might examine the data in Datameer, build a

machine learning model on Apache Spark, with Python code in a notebook.

We’ll also run some data through the model to test it, and we’ll bring a data set back into

Datameer that includes that data, the model’s predictions, and the actual values for the

column it tried to predict. Let’s start on the Datameer side.

Preparatory Workflow in Datameer

Earlier we mentioned that performing preparatory work in notebooks (or other code-

based environments) can be less efficient than doing so in tools and platforms that are

more geared to self-service users and more based on direct work with the data rather

than on code meant to manipulate the data.

In order to demonstrate the validity of that claim, let’s examine how many of these

preparatory tasks could be carried out in Datameer. As we go through this explanation,

you’ll see that the Datameer solution is more integrated, code-free and superior in

visuals. You’ll see also that data transformation operations are at once richer and yet

more declarative in nature. That may sound a bit abstract now, but the example will

make it clearer.

First let’s start with mere data acquisition or ingest. Unlike the notebook case, where

explicit code must be written to connect to a database or file, Datameer provides a

file browser metaphor, similar in concept to Windows’ Files Explorer or Mac’s Finder.

Data stewards, database administrators or personnel in a central analytics group can

configure the population of the folder structure in the browser.

This means access to data in Datameer can be handled with a sophisticated division

of labor. A specialist can define the connection to a database or file, and less technical

business users can consume the connections on a self-service basis. That means

the self-service users don’t need to know about server and database names, storage

volumes, folder names or login credentials. On the other hand, self-service users who

are in mastery of such information can build the connections themselves, in addition to

consuming connections built by others.

Connections can be passive links to remote databases, import jobs that bring data

into the Datameer workbook, or file uploads that transfer data to Datameer from files

Page 16: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 16

contributed by users from their own storage media. Data can live on the Datameer server,

in the database or in cloud storage.

A specialist can set up a file upload job from a URL point to the file:

Figure 12: Creating a URL-based file upload job pointing to our CSV file

And a business user can then import data the file at that URL, simply by selecting the file

upload, and now the data comes right in:

Figure 13: Viewing the census income data in a Datameer workbook

Note the currently selected column (workclass in the above case) is profiled in the

Datameer Inspector, on the right-hand side of the screen.

Note that unlike what has to happen in a data science notebook, not a single line of code

has been written here. Data profiling is carried out by Datameer implicitly, and is available

Page 17: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 17

as a dedicated view, or contextually while viewing the data. This kind of profiling is critical

to good data science work. It has to be done for virtually every experiment. Having it

a click away makes far more sense than needing a data scientist to write, execute and

debug code to do it each and every time it’s needed.

I look at the column workclass and notice there are some values of “?”. I also notice

those same rows contain a value of “?” in the occupation column as well. Based on a

hunch that the occupation column will be impactful on the prediction of income, we will

remove those rows from the data set:

Figure 14: Filtering out rows with “?” in the workclass and occupation columns

Let’s also delete the native-country column from the new sheet, since we don’t want to

use it as a criterion in predicting income:

Figure 15: Deleting the native-country column from the data set

Next, I am going to use one of Datameer’s built in algorithmic enrichment functions to

explore relationships in the data. This can help with the feature engineering process. I’ll use

the column dependencies algorithm to examine what fields might be indicators of income.

Page 18: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 18

Figure 16: Exploring Column relationships using algorithms in Datameer

These tell me that education, relationship and marital status seem to be indicative of income.

Now I am going to perform some “one hot encoding” in which I take some categorical

variables and convert them into a form that could be provided to our ML model to do

a better job in prediction. I use a Datameer formula to convert those columns with the

categorical variables into individual columns each representing one of the values, which

can make this feature engineering part of the preparation faster and easier.

Figure 17: One Hot Encoding in Datameer

Finally, let’s download the data from the shaped_data sheet. We’ll download the filtered

data from the new sheet in the workbook as a CSV file to our local machine.

Page 19: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 19

Figure 18: Downloading data from the filtered sheet in the workbook

Next, we can upload this file to our IBM Cloud Private for Data project:

Figure 19: Creating a new ICP4D data set from the downloaded worksheet data

Then, from a new Jupyter notebook in ICP4D, we can browse to our data set and

generate the code necessary to open it.

Page 20: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 20

Figure 20: Browsing to our data set and generating a Spark DataFrame from it

In subsequent code in the same notebook, we can use the data from the CSV file for

training, test and validation data.

Model Design, Training and Testing in an ML platform

Now that our shaped data has been brought into the ICP4D environment, we can start

building a model on it. To show you how easy this is, we’ve written additional PySpark

code (Python code, running on Apache Spark) in the notebook that does just this.

We’ll now show you highlights of what we did there, including defining, training

and testing the model, then pushing the “scored” data (the test data along with the

predictions generated from it) back into the workbook as a new sheet.

Our first task is to read the data from the export job CSV file into a Spark DataFrame,

called income:

## READ IN CENSUS INCOME DATA FROM CSV

Income = spark.read.csv(path=census_data_file_loc, header=True, inferSchema=True)

Note the variable census_data_file_loc, which would have been initialized previously,

points to the exported CSV file. Note also that we set header=True in order that Spark

recognize the same column names we set up back in the Datameer workbook; we also

then set inferSchema to True. Between these two settings, we avoid having to specify

anything about our columns’ names and data types.

Once the data is loaded, we can use Spark’s randomSplit function to divide the income

DataFrame into two new DataFrames, trainPartition and testPartition, for training and

test data, respectively. We’ll set it up so that 75% of the data is used for training, with the

remaining 25% of the data set aside for testing the model once it’s been trained.

Page 21: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 21

trainingFraction = 0.75; testingFraction = (1-trainingFraction)

seed = 1234

# SPLIT SAMPLED DATAFRAME INTO TRAIN/TEST

trainPartition, testPartition = income.randomSplit ([trainingFraction,

testingFraction], seed=seed);

We’re getting closer, but Spark’s LogisticRegression machine learning algorithm (which

we’ll use to create the model) requires all our feature and label data to be numeric. That

poses a challenge since much of our data, for columns like Occupation and Marital_

Status, is text-based. Spark’s machine learning APIs help us out here, by giving us an

object called a StringIndexer, which can scan all the string data in a column, assign a

numeric ID to each unique value, then add a column to the DataFrame containing the

corresponding ID for each string value in the original column.

We’ll need to create StringIndexer objects for the columns Workclass, Marital_Status,

Occupation, Relationship, Race, Sex and Income_Bracket. We’ll then use a Spark

Pipeline object to apply each of them and the partitioned DataFrames from the previous

snippet to generate two new DataFrames, (finalTrain and finalTest, for the training and

test data respectively).

The new DataFrames include all the indexed columns, which we can then use as features

in our ML model. We’ll name these indexed feature columns after their string-based

source columns, with “_ind” appended a s a suffix. We’ll name the indexed version of the

Income_Bracket column to “label” so that Spark’s LogisticRegression algorithm knows

to treat it as our model’s label.

# Create the StringIndexer objects

sI0 = StringIndexer (inputCol=”Workclass”, outputCol=”Workclass_ind”);

sI1 = StringIndexer (inputCol=”Marital_Status”, outputCol=”Marital_Status_ind”);

sI2 = StringIndexer (inputCol=”Occupation”, outputCol=”Occupation_ind”);

sI3 = StringIndexer (inputCol=”Relationship”, outputCol=”Relationship_ind”);

sI4 = StringIndexer (inputCol=”Race”, outputCol=”Race_ind”);

sI5 = StringIndexer (inputCol=”Sex”, outputCol=”Sex_ind”);

sI6 = StringIndexer (inputCol=”Income_Bracket”, outputCol=”Income_Bracket_ind”);

# construct pipeline based on StringIndexer objects

transformPipeline = Pipeline(stages[sI0, sI1, sI2, sI3, sI4, sI5, sI6]);

# Create final test and training Data Frames and temp tables

finalTrain = transformPipeline.fit(trainPartition).transform(trainPartition)

finalTest = transformPipeline.fit(testPartition).transform(testPartition)

Page 22: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 22

With that work complete, it’s almost time to build our model. One more preparatory step that

remains is to use a Spark VectorAssembler object to add a new column (called “features”) to

the DataFrame. This new column collects the values for all the feature columns into a list of

sorts, called a Vector. This step is necessary as the Spark LogisticRegression object, which

we’ll use to build our model, requires features to be vectorized.

Our features consist of six of the seven indexed columns we created with the

StringIndexer objects, as well as the Age, Fnlwgt, Education_Num, Capital_Gain ,

Capital_Loss and Hours_Per_Week columns. We’ll list those using the VectorAssembler’s

setInputCols method and we’ll send the value “features” to setOutputCol so this new

vectorized column has that as its name.

# vectorize all numeric feature columns into a new column call “features”

featuresAssembler = (VectorAssembler()

.setInputCols([‘Age’, ‘Workclass_ind’, ‘Fnlwgt’, ‘Education_Num’, ‘Marital_Status_

ind, ‘Occupation_ind’, \

‘Relationship_ind’, ‘Race_ind’, ‘Sex_ind’, ‘Capital_Gain’, ‘Capital_

Loss’, ‘Hours_Per_week’])

.setOutputCol(“features”))

With our vectorized feature values in a column called “features” and our label in a column

called “label”, we are ready to define and train the model! We’ll create a Pipeline, object

passing it the VectorAssembler object and a LogisticRegression object. Then we’ll

generate a model by calling the Pipeline’s fit method, passing it the finalTrain DataFrame

we created previously.

# Initialize the LogisticRegression object

lr = LogisticRegression(maxIter=10, regParam=0.01)

# Create pipeline based on VectorAssembler and LogisticRegression objects

Pipeline = Pipeline(stages=[featuresAssember, lr])

# Create the model, with finalTrain Dataframe

Model – pipeline.fit(finalTrain)

Note the settings for maxIter and regParam are somewhat arbitrary and beyond the

scope of this paper.

With the model built, we can now run our test data through it. This will generate a new

DataFrame that includes columns called “prediction” and “probability”. We can save that

data to a Spark SQL temp table, then select a subset of its columns and overwrite the

DataFrame with the result:

Page 23: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 23

# Evaluate the model with finalTest Dataframe

Predictions = model.transform(finalTest)

Predictions.createOrReplaceTempView(“tmp_predictions”)

# Remove indexed columns from predictions Dataframe and add “correct” Boolean

column

sqlStatement = “SELECT Age, Workclass, Fnlwgt, Education, Marital_Status,

Occupation, Relationship, Race, Sex, \

Capital_Gain, Capital_Loss, Hours_Per_Week, Income_Bracket, label,

prediction, \

label = prediction AS correct, probability \

FROM tmp_predictions”

predictions = spark.sql(sqlStatement)

Finally, we can write out the contents of the temp table to a CSV file, suitable for pushing

back into our Datameer workbook. The file will be saved in a folder called PredictionData,

which is itself located in the folder pointed specified by the dataDir variable, which was

initialized previously. This is shown in the code below:

Predictionsfilename = dataDir + “PredictionData”;

Predictions.write.mode(“overwrite”).option(“header”, “true”).

csv(predictionsfilename)

Round-Tripping: Bringing Test Results Back into Datameer for Evaluation of Accuracy and Analysis

With our model now trained and tested, we can examine the prediction data back in

Datameer. To get the data back in, we download the generated CSV file and the create a File

Upload job in Datameer to push it back into the workbook. The end-result is shown below:

Figure 21: Viewing the prediction data back in the workbook

Page 24: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 24

Note the presence of “label,” the indexed version of the Income_Bracket column, as well

as the presence of the “prediction” column, both at the far right. If you look carefully,

you’ll see that the prediction value is often, but not always, the same as the label. The

percentage of the time that the label and prediction value are the same will give us some

insight into the accuracy of the model we’ve built.

Determining that accuracy is easy enough. To start with, we could duplicate the sheet

and add a column to the new sheet, called “correct,” that returns the value 1 when the

label and prediction are the same and zero when they’re not:

Figure 22: Defining the “correct” column

We could then create a new sheet, grouped on the boolean value True (so that the sheet

contains a single group that includes all rows), then use the GROUPSUM function on the

Correct column to get the number of correct predictions, the GROUPCOUNT function to

get the total number of rows, and add a calculated column to determine the percentage

of correct predictions.

Figure 23: Determining ML model accuracy

As we can see from our analysis of the prediction data back in Datameer, the model we

built in Spark appears to be about 82% accurate. Not bad!

Returning to the Visual Explorer allows us, for example, to determine the accuracy of the

model across genders. One thing we are able to find out is that our model seems to be

more accurate for female subjects than male subjects:

Page 25: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 25

Figure 24: Breakdown of correct and incorrect predictions by gender

Specifically, what we find is that while 91% of predictions for female subjects are correct,

only 78% of predictions for male subjects are correct. This insight proves that using an

analytics tool like Datameer in tandem with a machine learning platform is very effective:

after training and testing the model on the Spark platform in Python, we’re able to send

the scored data to Datameer and analyze patterns in the model’s accuracy.

Page 26: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 26

Summing upThere’s more we could do here, of course, but the above validation example, as well as

the entire workflow described in this report, demonstrate the complementary nature of

Datameer, an AI platform like IBM Cloud Private for Data, with Apache Spark and Jupyter

notebooks. We’ve shown the full round-trip flow, and examined how each platform can

be used for its strong suits, avoiding inefficiencies on either side.

Data analysis and ML experimentation are overlapping yet synergistic disciplines. While

data science tools can be used for both, using analytics tools in the phases preceding

and following model design/training/testing is more appropriate and will lead to better

results. Tools focused on prep and analysis are key assets in successful ML work, in

support of building ML models, and in evaluating them.

Data work exists along a phased continuum. Use the right tool for the right job,

leveraging in each phase the components that excel at the tasks involved. Doing so

will benefit the outcome of that phase and the process overall. The efficient workflow

we’ve described in this report will assure the best quality machine learning work and will

enhance the success of that work’s outcomes.

Page 27: White Paper Creating an Agile Closed Loop - Datameer€¦ · White Paper Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data . CREATING

CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND

IBM CLOUD PRIVATE FOR DATA

Datameer WHITE PAPER

PAGE 27

About DatameerDatameer is an analytics lifecycle platform that helps enterprises unlock all their raw data.

The cloud-native platform was built for the complexity of large enterprises—yet it’s so

easy to use that everyone from business analysts to data scientists to data architects can

collaborate on a centralized view of all their data. Without any code, teams can rapidly

integrate, transform, discover, and operationalize datasets to their projects. Datameer

breaks down data siloes, gets companies ahead of their data demands, and empowers

everyone to discover insights. Datameer works with customers from every industry

including Dell, Vodaphone, Citibank, UPS, and more. Learn more at www.datameer.com.