20
Choosing Data Visualization Tools for Data Scientists By: Heather R. Gilley Introduction Part of becoming an operational business intelligence (BI) office is being able to communicate the key insights derived from data acquisition and analysis. To effectively communicate these insights, the right data scientist, product, and tool need to be paired together for the task. Currently, the BI office is staffed with data scientists who regularly receive requests for data visualizations such as reports, dashboards, and analytical updates. The challenge they face now is to choose the right tool or tools. Using the approach outlined below, a decision analysis was conducted to determine how each data visualization tool alternative scored against the objectives. Method 1. Identify strategic objectives for choosing a data visualization tool by eliciting the decision maker and referencing key documents 2. Develop data scientist profiles to identify necessary tool features that support the various skillsets associated with data scientists 3. Identify the product features that need to be supported by the data visualization tool 4. Align product features and data scientist skill sets to the functional objectives 5. Construct performance measures that accurately gauge the functional objectives 6. Determine the importance of each objective and measure 7. Analyze results for alternatives and conduct sensitivity analysis of the measures To choose the right data visualization tool, data scientist skills and product features need to be identified and incorporated into the model. Data scientists have varying skillsets and different products share key features that the data visualization tool must be able to manage. These skills and features are developed into the decision analytical model as part of the requirements. To develop the functional objectives and measures, key strategic documents, job position requirements, market research, and discussions with the decision maker were used to identify data visualization tool features that align to the data scientist profiles and product features. i 1 | Page

Choosing a Data Visualization Tool for Data Scientists Report

Embed Size (px)

Citation preview

Page 1: Choosing a Data Visualization Tool for Data Scientists Report

Choosing Data Visualization Tools for Data ScientistsBy: Heather R. Gilley

IntroductionPart of becoming an operational business intelligence (BI) office is being able to communicate the key insights derived from data acquisition and analysis. To effectively communicate these insights, the right data scientist, product, and tool need to be paired together for the task. Currently, the BI office is staffed with data scientists who regularly receive requests for data visualizations such as reports, dashboards, and analytical updates. The challenge they face now is to choose the right tool or tools. Using the approach outlined below, a decision analysis was conducted to determine how each data visualization tool alternative scored against the objectives.

Method1. Identify strategic objectives for choosing a data visualization tool by eliciting the decision maker and referencing

key documents2. Develop data scientist profiles to identify necessary tool features that support the various skillsets associated

with data scientists3. Identify the product features that need to be supported by the data visualization tool4. Align product features and data scientist skill sets to the functional objectives5. Construct performance measures that accurately gauge the functional objectives6. Determine the importance of each objective and measure7. Analyze results for alternatives and conduct sensitivity analysis of the measures

To choose the right data visualization tool, data scientist skills and product features need to be identified and incorporated into the model. Data scientists have varying skillsets and different products share key features that the data visualization tool must be able to manage. These skills and features are developed into the decision analytical model as part of the requirements. To develop the functional objectives and measures, key strategic documents, job position requirements, market research, and discussions with the decision maker were used to identify data visualization tool features that align to the data scientist profiles and product features. i

1 | P a g e

Page 2: Choosing a Data Visualization Tool for Data Scientists Report

Key Product FeaturesThe business intelligence office often is tasked with creating data visualizations to communicate analytical results. These data visualization products fluctuate depending on the customer, function, and requirements, but each of these products fall somewhere within the spectrum of interactive to static and explanatory to exploratory. An interactive product offers the audience the chance to view individual pieces of data on the chart and filter/sort their view to find more insights, while a static product has a single message being conveyed in one image. An explanatory product provides the audience with a story that leads the user to the final results, while an exploratory data visualizations provide the audience with a product that is meant to be analyzed for multiple storylines. The following graph illustrates how some of the most commonly requested products fall on the spectrum.

2 | P a g e

Interactive ChartDashboard ReportInfographicStaticExploratoryInteractive Explanatory

Figure 1: Product Scope

Page 3: Choosing a Data Visualization Tool for Data Scientists Report

Data Scientist ProfilesWhile the products are one part of the decision model, the decision centers on choosing the right tool for the data scientist. The purpose of the profiles is to identify the tool features that will best enable the data scientists’ skill sets. Many articles recognize that there are a variety of data scientists and skill sets1 available. The following data scientist profiles were developed based on market research, current employees, and organizational requirements.

AlternativesThere are many options for data visualization tools and each one seems to serve a separate purpose. The business intelligence office has identified six alternatives for data visualization tools. Currently, the team has temporary licenses for all of the alternatives in order to test the tools’ capabilities against their datasets. The client is not opposed to choosing more than one alternative depending on the analytical results of the decision model. For detailed information for each alternative, refer to the Alternatives tab of the Data Visualization Tool Decision Model excel workbook.

Data Visualization Tool Alternatives:D3.js A JavaScript library that enables developers to create complex, custom data visualizations on the web

RShiny A R library and server that enables R data visualizations to be interactive and available via a HTML framework

Bokeh A data visualization for python that creates charts from D3 visuals and the python data

Plot.ly A web application that automatically creates visualizations from a variety of files types and programming languages

Tableau A data visualization tool that offers an easy-to-use user interface to create complex graphics and charts

Kibana An open source data visualization and dashboarding tool that connects to the NoSQL database, elastic search

Strategic ObjectiveThe strategic objective was formed using the documentation for the BI program, VANDL. The goal of VANDL is to develop data science people, skills, and tools for the intelligence community. Part of this goal includes the development of tools for their data storage, analytics, and visualization suite. These tools have their own strategic objective to be considered, that objective is to design for usability, extensibility, scalability, and affordability. The current focus is the visualization tool suite. Using the organizations documentation and conversations with the decision maker, the following strategic objective was identified for choosing data visualization tools.

1 Top skill sets for data scientists and Analyzing the Analyzers

3 | P a g e

Developer Data ScientistFeatures:

1. Knowledgeable in programming, computer science, and databases

2. Creates connections between the data and the tools

3. Transforms data to enable profiles 1 and 2 to perform analysis and communicate results

4. Creates highly customized interactive solutions

Mathematical & Statistician Data ScientistFeatures:

1. Knowledgeable about complex statistical modeling and analysis (ex. customer opinion modeling, classification, text analysis, natural language processing, etc.)

2. Builds, tests, and analyzes models utilizing statistical programming languages such as, python and R

3. Uses built-in tools and statistical programming language libraries to build visualizations

Domain Data ScientistFeatures:

1. Knowledgeable of the subject matter and is able to add context to the analysis for insightful findings

2. General analysis (regression, correlation, frequency distributions)

3. Uses built-in tools for analysis

Figure 2: Data Scientist Profiles

Page 4: Choosing a Data Visualization Tool for Data Scientists Report

Choose a tool or tools that enable data scientists to manipulate, analyze, interpret, and visualize data

Functional ObjectivesFunctional objectives are specific and measureable parts of the strategic objective. Since the products and data scientists are ‘the who’ and ‘the what’ that determine which tool is chosen; those components are incorporated into the functional objectives. The following table outlines and defines the functional objectives.

Table 1: Functional Objective Definitions

Functional Objectives DescriptionBe flexible enough to accommodate different product types

The data visualizations created follow into one of four categories, dashboards, reports, charts, or infographics. Each product has different requirements that will be captured in the measures.

Enables statistical analysis and discovery

It is easier to recognize patterns and identify important insights when data scientists are able to visually analyze the data. In addition, being able to visually represent analysis plays a key role in identifying and communicating analytical insight.

 Enables highly customized solutions Some solutions need more advanced data visualizations, by having a tool that goes beyond basic bar charts, line charts, and pie charts the data scientists can create a visualization that meets those needs

 High usability Not everyone has the skill set to code solutions. Tools with advanced intuitive GUIs enable data scientists to quickly create data visualizations.

Scales with big data projects The customer experience business intelligence office has a large data set that is rapidly growing, the selected tool must be able to scale with the incoming data.

MeasuresMeasures were created to gauge how well an alternative scores against a functional objective and ultimately the strategic objective. These measures were created by reviewing existing documentation and creating an affinity diagram to visually map objectives and measures. The scale defines how the measure is gauged and the range determines the scope for the scores. Measures that are gauged using a Likert scale are qualitative, scores were determined by interviewing the data scientists who have tested the alternate data visualization tools and by eliciting the decision maker whenever possible. These measures were determined to be independent of each other. The following table defines the measures, their units, and their scale.

Table 2: Measure Definition and Scale

Measure Description Scale RangeAnalytical Capability Level of analysis built into user

interfaceLevels defined by the Likert scale Analytical

CapabilityCharting Capability Charting capability allows the user to

create complex chartsLevels defined by the Likert scale Charting

Capability

Programming Capability Programming capability allows the user to customize the products appearance and functionality

Levels defined by the Likert scale Programming Capability

Design Capability Capability to change the appearance of the product

Levels defined by the Likert scale Design Capability

Number of Supported Programming Languages

The number of programming languages the tool is able to process

Count of the programming languages the tool is able to support

Number of Supported Programming Languages

4 | P a g e

Page 5: Choosing a Data Visualization Tool for Data Scientists Report

Measure Description Scale RangeGUI Tools with user interfaces vs tools with

interactive development environmentLevels defined by the Likert scale GUI

Interactive Product Capability

How well the tool enables products to be interactive

Levels defined by the Likert scale Interactive Product Capability

Number of Supported File Types

The number of files types that the tool allows to be imported and exported

Levels defined by the Likert scale Number of Supported File Types

Data connectors The number of data sources the tool can use

Count of features that allows the tool to connect to different data sources

Data connectors

Access Control The layers of user access control that can be applied to the products and the data behind the products

Levels defined by the Likert scale Access Control

Cost The yearly total cost per user to keep a tool

Total cost per user per year Cost

Data Size The quantity of data the tool is able to ingest and chart. This exact amount varies across datasets; however different tools are able to scale to different levels

Levels defined by the Likert scale 1 to 5

Analytical ApproachIn order to evaluate the alternatives, measures were applied to the functional objectives. These measures were identified as indicators of the functional objectives because they overlap with the features necessary to accommodate the different data scientist skill sets and the different data visualization requirements. A card sort activity was conducted to ensure the data scientist profiles and product requirements aligned with the functional objectives and measures.

Mapping Objectives and Measures to Data Scientist ProfilesFunctional objectives and measures were created using the features of the data scientist profiles. Some measures, such as number of supported programming languages, were identified as cross profile requirements to be flexible enough to accommodate different product types. The ability to be flexible enough to accommodate different product types takes into consideration that data scientists have different skills to support the same products. The following diagram indicates how the profiles aligned to the functional objectives and measures.

5 | P a g e

Page 6: Choosing a Data Visualization Tool for Data Scientists Report

6 | P a g e

Figure 3: How Data Scientist Profiles Align to Functional Objectives

Page 7: Choosing a Data Visualization Tool for Data Scientists Report

Decision Model StructureAfter identifying the strategic objectives, the functional objectives, and the measures, the decision model for choosing a data visualization tool or tools can be depicted in the following diagram:

7 | P a g e

Figure 4: Decision Model Hierarchy

Page 8: Choosing a Data Visualization Tool for Data Scientists Report

Scoring the AlternativesOnce the model has been defined, the alternatives are evaluated and scored against the independent measures. The information found on the alternatives was through independent research and feedback from the BI data scientists testing the alternatives. Some of the measures were identified as being more subjective, these measures were scored on a Likert scale, with 1 being ‘does not have capability’ and 5 being ‘capability highly exceed expectations’, to create consistency between scores. The remaining measures could be quantified by either count or dollar amount.

Late into the development of the decision model, it was identified that more in-depth information on alternatives was available through commercial research conducted by In-Q-Tel. This company identifies, adapts, and delivers innovative technological solutions to the intelligence community and is currently conducting research on data visualization tools for data scientists. After this discovery, the decision maker determined that this information will be implemented into the second phase of the decision model, in the future along with any other identified improvements.

Determining WeightsTo determine the weights for the measures, the swing weight method was applied. The first step was to determine swing weights is to identify the best and worst alternatives that could exist. Next step was to elicit the decision maker for how the measures should be ranked. During this time the decision maker was unavailable, so additional team members were consulted to determine how to rank each measure. Finally, the weights were calculated using the identified ranks. The following table shows the worst/best alternative and their corresponding weights.

Table 3: Swing Weights

Worst Best Rank Weight Weight

Interactive Product Capability (IP) 1 5 1.00 0.132 Total Rank Weight 7.55 WIP

Analytical Capability (AN) 0 581 0.95 0.126 WeightIP = 0.1325

Charting Capability (CH) 1 5 0.85 0.113

Data Size (DS) 1 5 0.80 0.106

Number of Supported Programming Languages (PL)

0 4 0.75 0.099

Data Connectors (DC) 2 40 0.70 0.093

Access Control (AC) 0 5 0.60 0.079

Programming Capability (PC) 1 5 0.55 0.073

GUI (G) 1 5 0.45 0.060

Design Capability (DC) 1 5 0.40 0.053

Number of Supported File Types (FT) 1 5 0.30 0.040

Cost (C) 1999 0 0.20 0.026

8 | P a g e

Page 9: Choosing a Data Visualization Tool for Data Scientists Report

Analysis and ComputationWhen the building blocks of the decision model were established, the model was built into Excel and Logical Decisions for Windows. Logical Decisions for Windows was used to build the model for calculating the subjective goal of choosing a data visualization tool that enables data scientists. Excel was used to calculate the results for each data scientist type. The following chart shows the ranked results for each alternative and how they score against the functional objectives.

Figure 5: Alternatives Ranked by Goal: Choose Data Visualization Tool

From the results we can see that there are alternatives with very similar scores: Tableau & Plot.ly and Bokeh & RShiny. The following sections highlight those differences and the tradeoffs of choosing one tool over another.

Comparing AlternativesPlot.ly vs. Tableau

Figure 6: Plot.ly vs. Tableau Tornado Diagram

Plot.ly and Tableau scored vary similarly. Both tools are capable of creating products within the Business Intelligence Office’s scope and provide a platform for data scientists to explore various datasets, but with different tradeoffs. Plot.ly allows data scientists to use multiple data manipulation tools such as Python, R, and Excel to create advanced

9 | P a g e

Page 10: Choosing a Data Visualization Tool for Data Scientists Report

visualizations and conduct advanced analytics in a collaborative setting. Tableau requires each data scientist to learn their spreadsheet language as opposed to using the skillsets they already possess. This is an advantage for Plot.ly as it allows data scientists with desperate skill sets to collaboratively use the same tool. However, Tableau is able to connect to a larger number of data sources and is able to process datasets that qualify as “big data”. Since the government is one of the largest producers of data, this is an important requirement to consider.

RShiny vs. Bokeh

Figure 7: RShiny vs. Bokeh Tornado Diagram

Bokeh and RShiny had the same score, but in the diagram above you can see the tradeoffs of choosing one tool over another. As part of the R library, RShiny is supported by a multitude of statistical programming libraries. Also, the RShiny package includes RShiny Server, which is able to connect to many different data sources. However, RShiny requires the user to implement a CSS file to change the styles. Bokeh allows the data scientist to utilize design options to enhance products, improve communications, and is also supported by multiple statistical programming libraries, but not as many as R.

10 | P a g e

Page 11: Choosing a Data Visualization Tool for Data Scientists Report

Alternative Results for Data Scientist ProfilesEach of the data scientist profiles have corresponding functional objectives, as outlined in the Data Scientist Profiles section, to choose the best tool for each data scientist skillset. The following sections outline the results for each alternative as it relates to the data scientist profiles:

Domain Data Scientist

Figure 8: Alternatives Ranked by Domain Data Scientist Profile

The domain data scientist is focused on creating different customized product types with a usable tool. Tableau and Plot.ly both scored highly with the domain data scientist. These alternatives offer intuitive user interfaces that allow a data scientist to quickly create highly interactive charts that can be used for communications or analysis. While Tableau offers more design capabilities, Plot.ly’s ability to support multiple programming languages enables domain data scientists to collaborate with other data scientists more easily.

11 | P a g e

Page 12: Choosing a Data Visualization Tool for Data Scientists Report

Mathematical & Statistician Data Scientist

Figure 9: Alternatives Ranked by Mathematical & Statistician Data Scientist Profile

The mathematical & statistician data scientist is concerned with being able to conduct more complex statistical analysis on large datasets and being able to communicate those results. Plot.ly is a fairly new technology that is still developing their capabilities and currently is unable to handle datasets that qualify as “big data”. Plot.ly intends to expand their ability to ingest and process large data sets; however, Tableau currently has that capability built into their software.

Developer Data Scientist

Figure 10: Alternatives Ranked by Developer Data Scientist Profile

The developer data scientist is responsible for acquiring and transforming the data into a dataset that is usable for other data scientists; therefore, they are more concerned with scalability and customizability. As noted in the mathematical & statistician data scientist profile, Tableau is the best tool available for scaling with the data quantity.

12 | P a g e

Page 13: Choosing a Data Visualization Tool for Data Scientists Report

Sensitivity AnalysisThe results of the decision model are more sensitive to some measures over others. The following chart shows the results of the sensitivity analysis for the different measures in the decision model:

Score

Data Size (DS)

Number of Supported File Types (FT)

Design Capability (DE)

GUI (G)

Cost (C)

Access Control (AC)

Data Connectors (DC)

Number of Supported Programming Languages (PL)

Programming Capability (PC)

Charting Capability (CH)

Analytical Capability (AN)

Interactive Product Capability (IP)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0{0}

0.495839701{5}

0.515707251{5}

0.509084734{5}

0.55544235{5}

0.495839701{0}

0.5753099{5}

0.569536424{40}

0.595177449{4}

0.495839701{5}

0.495839701{5}

0.621667516{5}

0.528952284{5}

0{0}

0.469349635{1}

0.475972151{1}

0.456104602{1}

0.495839701{1}

0.422992019{1999}

0.495839701{1}

0.476821192{1}

0.495839701{1}

0.389879436{1}

0.38325692{1}

0.495839701{1}

0.396501953{1}

Choose a Data Visualization Tool

13 | P a g e

Page 14: Choosing a Data Visualization Tool for Data Scientists Report

ConclusionConsistently, Tableau and Plot.ly emerge as highly ranked alternatives. Tableau was chosen as the best option for the overall objective, the mathematical & statistician data scientist profile, and the developer data scientist profile; while Plot.ly was chosen as the best option for the domain data scientist profile. These options have different tradeoffs depending the data scientist needs and the product requirements. The data scientist profiles are leaning towards Tableau and to gain a more granular insight into the best tool depending on product requirements the model needs to be refined even further. This model is still fairly high-level and is currently under review by the decision maker to gain that level of granularity.

This decision model was formed by eliciting the project team members and referring to the project’s key strategic documents. Ideally, the decision maker would have been elicited consistently throughout the process; however, he was absent due to a family emergency for the majority of the project duration. Recently, the decision maker returned to the project and he is currently reviewing the results of the analysis. These changes will be incorporated into the future model along with any other identified changes made by the decision maker.

During the review process additional resources were identified for refining the decision model. In-Q-Tel conducted an in-depth study of over 50 data visualization tools with numerous attributes identified. The decision maker provided a document with quantifiable measures for dashboard product requirements. The study and the measures will be reviewed to determine if they need to be incorporated into the advanced decision model or if the results of this current study is enough to drive a decision.

14 | P a g e

Page 15: Choosing a Data Visualization Tool for Data Scientists Report

i Note, during the analysis process, the decision maker suddenly needed to be absent for an extended period of time due to a family emergency. The decision maker returned towards the end of the initiative and has identified areas for further analysis.