20
June 21, 2018 IBM Copyright ©2018 1 Rob Burton Cognitive Solution Design & Intelligent Process Automation Natural Resources Solution Center, IBM Canada Email: [email protected] Twitter: @robburton13 LinkedIn: https://www.linkedin.com/in/rgburton Data Preparation and Refining - For IBM Watson Solutions and Services DAMA - Data Management Association Calgary Chapter June 21, 2018

- For IBM Watson Solutions and Services

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

June 21, 2018 IBM Copyright ©2018 1

Rob Burton

Cognitive Solution Design & Intelligent Process Automation

Natural Resources Solution Center, IBM CanadaEmail: [email protected]

Twitter: @robburton13

LinkedIn: https://www.linkedin.com/in/rgburton

Data Preparation and Refining- For IBM Watson Solutions and Services

DAMA - Data Management Association – Calgary ChapterJune 21, 2018

Education: Industrial Robotics, Kwantlen Polytechnic University, and Computer Science, Douglas College

Former: 10 years information and technology architecture with IBM partner organizations; 15 yearsbusiness architecture and process automation with Pemberton Energy, Syncrude, WestJet, CP Rail, and TransCanada Pipelines.

Current: 3 years with IBM, business process automation and cognitive solution design for the Canadian natural resources industry.

Speaking Engagements: over 10 years include SAPinsight, IBM WoW, University of Calgary, Mount Royal University, CIPS, COSIA, IIA, OpEx, CGDMS, DAMA, and over 20 client events.Rob Burton

Cognitive Solutions Specialist, IBM Canada

About Me!

June 21, 2018 IBM Copyright ©2018 2

June 21, 2018 IBM Copyright ©2018 3

Disclaimer:

The information contained in this publication is provided for informational purposes

only. While efforts were made to verify the completeness and accuracy of the

information contained in this publication, it is provided AS IS without warranty of any

kind, express or implied. In addition, this information is based on IBM's current

product plans and strategy, which are subject to change by IBM without notice. IBM

shall not be responsible for any damages arising out of the use of, or otherwise

related to, this publication or any other materials. Nothing contained in this

publication is intended to, nor shall have the effect of, creating any warranties or

representations from IBM or its suppliers or licensors, or altering the terms and

conditions of the applicable license agreement governing the use of IBM software.

Objective of this Presentation

• Overview of Challenges Faced by the Data Scientist

• Provide Solutions to the Challenges

• Approach to Successful Data Preparation and Refining

June 21, 2018 IBM Copyright ©2018 4

Top 5 Challenges Faced by Data Scientists*

1. Identifying the Problem

2. Access to the Right Data

3. Data Cleansing

4. Lack of Domain Expertise

5. Data Security Issues

June 21, 2018 IBM Copyright ©2018 5

“In today’s business arena, data scientists are deemed as someone having

superhuman powers.”– Gary Cokins Opinion; Imagine data scientists with superpowers, Information Management Source Media

*https://www.proschoolonline.com/blog/challenges-faced-by-data-scientists/

Watson Studio

Build and train AI models, and prepare and analyze data, all in one integrated environment.

Challenge #1

IDENTIFYING the PROBLEM:

• One of the major steps in analyzing a problem and designing a solution is to first figure out

the problem properly and define each aspect of it.

• Many times Data scientists opt for a mechanical approach and start working on data sets

and tools without a clear definition of the business problem or the client requirement.

June 21, 2018 IBM Copyright ©2018 6

SOLUTION: There must be a well-defined workflow before beginning the actual data analysis

work. The first step in this process is to identify the problem well, designing a solution, building

a checklist to tick off important steps and finally analyze the results.

Challenge #2

ACCESS to the RIGHT DATA:

• For the right analysis, it is very important to lay your hands on the right kind of data.

• Gaining access to a variety of data sources in the most appropriate format is quite difficult

as well as time-consuming.

• There could be issues ranging from hidden data, insufficient volume of data or less variety

in the kind of data.

• Data could be spread unevenly across various lines of business so getting permission to

access that data can also pose a challenge.

June 21, 2018 IBM Copyright ©2018 7

SOLUTION: Data Scientists need to master data management systems and other information

integration tools such as stream analytics software which is useful for filtering and aggregating

data. Many data integration applications allow connections to external data sources and their

seamless inclusion in the workflow.

Watson Discovery

Unlock hidden value in data to find answers, monitor trends and surface patterns.

Challenge #3

DATA CLEANSING:

• Working with datasets full of inconsistencies and anomalies is every data scientist’s

nightmare.

• Dirty data leads to dirty results.

• Data scientists work with terabytes of data and imagine their plight when they have to

spend a huge amount of time just sanitizing the data before even beginning the analysis.

June 21, 2018 IBM Copyright ©2018 8

SOLUTION: Data Scientists must make use of Data Governance tools for overall accuracy, consistency and formatting of data. Additionally, maintaining data quality should be everybody’s goal. Business functions across the enterprise benefit from good quality data; bad data quality is an enterprise issue. There must be people employed in various departments as data quality managers.

Watson Knowledge Catalog

Intelligent data and analytic asset discovery, cataloging and governance to fuel AI apps.

Challenge #4

LACK of DOMAIN EXPERTISE:

• Data scientists just need to be skilled at high-end tools and mechanisms is one of the biggest misconceptions doing rounds.

• Data Scientists also need to have sound domain knowledge and gain subject matter expertise.

• One of the biggest challenges faced by data scientists is to apply domain knowledge to business solutions.

• Data scientists are a bridge between the IT department and the top management.

• Domain expertise is required to convey the needs of management to IT Department and vice versa.

June 21, 2018 IBM Copyright ©2018 9

SOLUTION: Data scientists need to work on gaining insights into business, understand the problem at hand and work on analyzing and modelling the solutions. Along with mastering statistical and technical tools, Data scientists also need to focus on the business requirements.

Watson Knowledge Studio

Teach Watson to discover meaningful insights in unstructured text.

Challenge #5

Data Security Issues:

• In today’s world, data security is a big issue.

• Since data is extracted through a lot of interconnected channels, social media as well as other nodes, there is increased vulnerability of hacker attacks.

• Due to the confidentiality element of data, Data scientists are facing obstacles in data extraction, usage, building models or algorithms.

• The process of obtaining consent from users is causing a major delay in turnaround time and cost overruns.

June 21, 2018 IBM Copyright ©2018 10

Solution: For this aspect, there are no shortcuts. One has to follow the established global data protection norms. There need to be additional security checks and use of cloud platforms for data storage. Organizations also actively need to use advanced solutions that involve Machine Learning to safeguard against cybercrimes and fraudulent practices.

Watson Machine Learning

Use your data to create, train, and deploy self-learning models. Leverage an automated, collaborative workflow to build intelligent applications.

Real-world Challenges are the Greatest Teachers

A popular proverb says, “rough seas make good sailors”!

• Instead of the theoretical aspects, data modelers need to approach their jobs with pragmatism.

• Data Science is not all about building models and algorithms.

• Analyzing data sets and predicting the outcome is as much an art as a science.

• Without human element, the whole process of Data Science will be rendered meaningless.

• By facing real-world challenges, Data Scientists will eventually learn to be proactive, creative and innovative in their approach.

June 21, 2018 IBM Copyright ©2018 11

Watson Natural Language Understanding

Natural language processing for advanced text analysis.

Thank you!

Rob Burton

Cognitive Solutions Specialist

Natural Resources Solution Center, IBM Canada

June 21, 2018 IBM Copyright ©2018 12

This article was for the basis for this presentation.https://www.ibm.com/blogs/bluemix/2017/12/data-preparation-refining-now-integrated-watson-data-platform/

June 21, 2018 IBM Copyright ©2018 13

June 21, 2018 IBM Copyright ©2018 14

Data preparation and refining – Now integrated in

Watson Data Platform

In the world of Data Science, the time required to transform data to good quality is a recurring barrier towards gaining insights. Data scientists or analysts will spend a bulk of their effort in cleaning the data using a variety of handwritten scripts. IBM Watson Data Platform’s data refining tools aim to reduce the pain associated with creating good quality data. The tool has an intuitive user interface and templates enabled with powerful operations to shape and clean data. It also provides metrics and data visualization which aid in every step of the process. Incremental snapshots of the results are provided allowing the user to gauge success with each iterative change. Saving, editing, and running the steps within projects provide the ability to refine data of almost any size within the Watson Data Platform.

IBM’s Data Refining tools are now available in Watson Data Platform’s open beta. We invite you to experience these data refining capabilities offered as an integrated experience in the Watson Data Platform.

June 21, 2018 IBM Copyright ©2018 15

1. Quick transformation to refine many types of data sourcesYou can connect to data sources in the public cloud or on-premises and refine the data within Watson Data Platform. Most of the file common data formats such as csv, delimiter separated, json, parquet, avro, and relational databases well as non-relational databases are supported.

Here is a sample process (Fig 1):A subset of the dataset is selected by the tool on which transformations can be iteratively applied. If these transformations provide the required result, they can be applied to the full dataset. Complex transformations can be applied quickly to perform complex data manipulation. For example, you can split a single column into multiple columns by auto-detecting separators, provide regular expressions or positions in the data. You can merge multiple columns into one and calculate columns from existing ones using custom formulas or conditions.

A variety of frequently used data cleaning functions such as data de-duplication, empty row removal, missing value replacement are provided. Text operations including as sub-string replacement using regular expressions, string concatenation, character padding, case conversion and math operations including absolute value, ceiling, floor, square root are supported. The tool also provides core data refining operations including filtering, sorting, column removal etc. Operations such as join, merge and transposition of multiple data sets are being added.

Fig 1: Auto-detecting delimiters to split columns

June 21, 2018 IBM Copyright ©2018 16

2. Advanced data shaping operationsWatson Data Platform provides advanced data shaping operations that are easy to use. The coding editor provides templates to help you build the structure of the command. The tool provides content assist to convert the command into executable code. Click on the templates and build advanced transformations to refine data.

You can select or reorder columns using name pattern matches or ranges. Templates provide a rich set of options for each transformation.

You can also apply advanced operations conditionally on multiple columns using a rich set of built-in functions and expression syntax. It has a rich library of built-in aggregation (sum, count, average etc), summary and sorting functions (Fig 2).

Fig 2: Template guided and content assisted coding

June 21, 2018 IBM Copyright ©2018 17

3. Profile view

Fig 3: Data distributions for integer and string columns

A quick way to get an

understanding of your data is to

look at the metrics of data

distribution (Fig 3). Visibility into

how the data distribution changes

after each step in the “refining” flow

help building the right steps in the

iterative cycle. Data distributions

show frequency of occurrences of

the values in the data, along with

counts for missing values.

June 21, 2018 IBM Copyright ©2018 18

4. Visualization

Another way to get an understanding of the data is to look at the distribution visually (Fig 4). Watson Data Platform has a large selection of built-in visualization tools. It suggests appropriate chart based on the data of the columns. You can take the suggested chart or choose one manually and customize the visualization to your liking.

Fig 4: Visualizations recommended by the tool for the selected columns

June 21, 2018 IBM Copyright ©2018 19

5. Iterative flow development and management

Cleaning data is almost always a multi-step iterative process. You can choose from a variety of connections to access the data. The data can be from over 25 diverse types of sources, from flat files to relational and non-relational databases. Once connected, the refining process involves building a flow with multiple steps for data cleaning and manipulation applied on a sample dataset. You can save the steps in your flow into a project, and modify it later. Once the flow produces clean data on the sample dataset, it is ready to be applied to the full data set using Spark. Once the job is completed, the transformed data is saved into the target location.

From the data flow details view in the project, you can monitor the flow execution status of current or previous runs, sources and targets used and the amount of data processed. You can re-use or enhance frequently used flows by opening them in the refiner, enhance them, optionally change target locations and re-run them.

June 21, 2018 IBM Copyright ©2018 20

Summary

Data refining tools now seamlessly integrated in Watson Data Platform. It

allows both data scientists, who like to code, and data analysts, who prefer

visual tools, to build repeatable data refining flows iteratively through rapid

visual feedback and a rich set of transformations. The shaped data can be

analyzed in the same project using Data Science Experience to produce

valuable insights.