16
Three Critical Steps to Improving Product Data Quality A DataFlux White Paper Prepared by Jim Harris

Three Critical Steps to Improving Product Data Quality kfs ...viewer.media.bitpipe.com/971028343_44/1296488581_83/3_Critical...predict sales trends as well as plan future product marketing

Embed Size (px)

Citation preview

Three Critical Steps to Improving Product Data Quality

A DataFlux White PaperPrepared by Jim Harris

1

Introduction Convincing your organization to view data as a strategic corporate asset and, by extension, data quality as a strategic corporate discipline can be challenging. The relationship between business processes and the data used and created by those processes is not always obvious and tangible. In other words, how does the organization’s data affect its business decisions and its ability to succeed?

Since the strategic importance of one corporate asset has never been in question, namely the products your organization sells, the data describing those products must be of sufficient quality to support optimal business performance, right?

Let’s imagine you work for Acme Foods and are making a presentation to executive management about the need for improvements in product data quality. You tell the eight executives in the room that each has on the table in front of him a different product from a current list of Acme Foods’ Top 100 Best Selling Products.

The executives are confused because they all have the same kind of candy bar in front of them. Each has a card attached with a number and some text written on it. You explain that the number is the sales rank and the text is the product description, which was copied directly from the Acme Foods master product catalog.

They pass their candy bars around the room, pausing to read the attached cards. After a few minutes, you display the following chart as your only presentation slide:

You briefly point out a few of the obvious product data quality issues:

Numerous variations in the official brand name (E<3MC2), which stands for “Everybody Loves Milk Chocolate Squared”

Six duplicate records are describing one product (excluding #15 and #55), meaning (at least) six of the top 100 best sellers are actually the same product

#15 is not a duplicate because of a different unit count based on packaging; #55 is not a duplicate because of a different unit size (i.e., it is a bag of ten smaller chocolate squares instead of one larger chocolate square candy bar)

2

Business Impacts of Poor Product Data Quality Confronted with a tangible demonstration of the product data quality issues plaguing Acme Foods, the executives begin discussing some of the business impacts:

Sales Forecasting – Incorrect sales numbers negatively impact the ability to predict sales trends as well as plan future product marketing and promotions

Spend Analysis – Incorrect sales also negatively impact the procurement planning for purchasing the raw materials to make the products being sold

Supply Chain Optimization – Incorrect procurement levels trigger manufacturing disruptions and inefficiencies throughout the supply chain

Inventory Management – Incorrect inventory levels cause order fulfillment delays in distribution channels, leading to delayed revenues or lost sales

In far more general terms, the bottom line is that poor product data quality:

Increases costs

Decreases revenue

Increases risks

Disrupts daily operations

Causes bad tactical business decisions

Undermines strategic corporate planning

Even though Acme Foods prides itself on excellent business process management as well as hiring and then investing in great people and implementing the latest technology, none of these best practices can save it from the havoc wreaked by poor data quality.

Data must be viewed as a strategic corporate asset and, by extension, data quality a strategic corporate discipline, because high-quality data serves as a solid foundation for success, enabling better business decisions and optimal business performance.

Congratulations! The Acme Foods executives just approved a product data quality improvement project. Now what? How will you approach this daunting challenge?

Using Acme Foods as a fictional case study, this white paper will describe a general approach for planning your organization’s efforts to improve product data quality. It will provide a data-example-driven perspective of some of the unique challenges of product data quality, as well as discuss and demonstrate the three critical steps to improving product data quality.

3

Unique Challenges of Product Data Quality Product data presents some challenges that are different from other data domains. The first unique challenge of product data quality is that “product” is a generic term that can mean many different things. For example, a product could refer to:

Raw materials used to manufacture products, e.g., the cocoa beans that Acme Foods purchases as a raw material for manufacturing chocolate

Semi-finished goods from an intermediate stage of product development, e.g., the couverture chocolate that Acme Foods uses to make candy bars

Finished goods, i.e., stock-keeping units (SKUs), which may be a single product, a package containing several products, or multiple products within the same brand based on packaging variations in the unit size and unit type

Example of Packaging Variations

4

Other data domains, such as Customer Name and Postal Address, have a relatively small set of easily defined and recognized data attributes and data quality standards. (However, these definitions and standards are not always consistently enforced.)

But the complex product supply chain includes manufacturers, distributors, suppliers, wholesalers, retailers and other vendors. All of these organizations typically maintain their own product catalogs, often with inconsistent data quality standards.

There are some standards for product data quality, but they are not yet as widely adopted as standards for other data domains. Examples of these standards include:

United Nations Standard Products and Services Code (UNSPSC) – defines over twenty thousand categories of common commodities and services

Uniform Code Council (UCC) – specializes in data standards for bar codes and electronic data interchange (EDI), primarily for North America

European Article Numbering (EAN) – European standards similar to UCC

EPCglobal – collectively established by UCC and EAN to develop standards for electronic product codes (EPC) and radio-frequency identification (RFID)

Universal Product Code (UPC) – worldwide bar code standard for the electronic identification of containers, pallets, cases, products and SKUs

These standards can assist with establishing consistent product descriptions and assigning unique product identifiers. However, these identifiers can suffer from the same data entry errors and data formatting variations as identifying attributes for other data domains. Also, these identifiers may not always be available and could be replaced with proprietary product identifiers, or even database surrogate keys.

Therefore, effectively implementing these or other product data standards often requires matching based on product description, which is usually unstructured, meaning that most product data attributes are buried within a free-form text field. And when you are creating your own product data standards, or receiving third-party product data that follows a different standard (or none at all), recognizing and extracting product data attributes from a free-form text field will be your primary task.

Therefore, categorizing, standardizing and matching product descriptions are three fundamental challenges to overcome when improving product data quality. Data quality tools provide considerable assistance with these challenges. However, compared to other data domains, a product data quality project will typically require more customization of what the data quality tool provides “out of the box.”

Most of the customization effort is teaching the tool how to understand what are essentially the vocabulary, spelling and grammar of the product data “language.”

5

Improving Product Data Quality The three critical steps to improving product data quality are:

1. Categorization

2. Standardization

3. Matching

The remainder of this white paper will discuss and demonstrate these concepts from a data-example-driven perspective using the fictional products of Acme Foods.

Categorization Determining the product category is an important first step because the category provides context for the product description, where the same words, abbreviations and symbols can mean something different within different product categories.

For example, consider the following product descriptions:

6

Many large organizations have diverse product catalogs using a complex taxonomy or hierarchy of product categories, which are often managed by different groups of subject matter experts (SMEs). Categories are sometimes keywords that are found within the product description, but most often the category must be extrapolated from a semantic understanding of the product description.

By determining the category for these product descriptions, we can begin to divide and conquer the challenge of improving product data quality by using category as a filter to route records to category-specific standardization processes. Data quality tools provide assistance by parsing the free-form product description to search for the key words, phrases and other logic necessary for categorization.

For simplicity, the data examples we are working with only represent two categories, Candy and Beverage. But simply categorizing all product descriptions containing the word “Chocolate” as Candy and “Sugar” as Beverage would improperly categorize both the Chocolate Energy Drink and the Sugar Chewing Gum.

Therefore, the automated categorization process provided by the data quality tool has to use natural language processing and instantiate the knowledge of data SMEs. The Acme Foods SMEs have helped us properly categorize the product descriptions:

Please Note: It is a recommended best practice to design your categorization process as a separate function so that the technical processes are aligned naturally with the category-specific business rules provided by the product data SMEs.

7

Standardization Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value and a general lack of data quality standards. Additional variations are introduced by multiple data sources, each with its own unique data characteristics and data quality challenges.

Standardization parses free-form fields to break them down into smaller individual fields to gain improved visibility of the available input data, create a more consistent representation, apply standard values and, when possible, populate missing values. However, it is important to note that sometimes what appear to be semantic inconsistencies in product data are intentional variations to accommodate such aspects as regional and linguistic differences, as well as special promotions.

Therefore, the standardization process should be designed as modular as possible to support a plug and play approach for various components, similar to how it was recommended that categorization and standardization should be separate processes.

Data’s quality is determined by evaluating its fitness for the purpose of business use. However, in the vast majority of cases, data has multiple business uses, and data of sufficient quality for one use may not be for other valid business uses. When the standardization process has a flexible architecture, it is easier to convert among various product data standards and support a wider range of business purposes.

Most of the product attributes in our data examples are stored within the overloaded description field, such as unit count, unit size, unit measure and unit type. Even when the data source contains these attributes as separate fields, they can be sparsely populated or contain defaults or other values conflicting with the content of the product description field.

Our product data standardization process is going to create the following fields:

Brand – the brand name of the Acme Foods product

Unit Count – the number of units in the packaged product

Unit Size – the number associated with the unit of measurement

Unit Measure – the unit of measurement for the product

Unit Type – the packaging type of the product

Product Description – remaining description not covered by the above fields

Please Note: Many additional fields are commonly created when standardizing product data, especially to facilitate improved matching, but this white paper focuses on the above fields for the purposes of demonstrating standardization concepts.

8

Candy Brands Let’s begin by focusing on only the products in the Candy category:

Our Candy SMEs have highlighted in bold the contents of the product description that is appropriate for the new Brand field we are creating in this two step process. The first step is to separate the brand name content from the product description:

The second step is to standardize the representation of the brand names:

Please Note: Implement these steps separately to make it easier to apply different standards when appropriate (e.g., using regional brand names in a local language).

9

Beverage Units Now let’s focus on only the products in the Beverage category, which has already been branded following the same process described in the previous section:

Our Beverage SMEs have highlighted in bold the contents of the product description that is appropriate for the new Unit fields we are creating in this two-step process. The first step is to separate the unit information from the product description:

The second step is to standardize the representation of the unit information:

10

Please note: Missing Unit Counts were populated with “1” as their default value, and the remaining content of the original product description has also been standardized.

Before and After Standardization

11

After applying all of the standardization logic described above, we can easily see the dramatic improvement in the data quality of our product data examples:

Matching Matching for product data is usually performed for either comparing records within and across data sources in order to evaluate if they correspond to the same product (i.e., are duplicates) or for matching records against a standard product reference (e.g., UNSPSC in order to obtain the product commodity classification code). Matching often uses standardization to prepare its input. This facilitates a direct evaluation of comparable fields (e.g., brand name to brand name) and more reliable comparisons based on standardized values. It also decreases the failure to match records because of data variations, and increases the probability of effective match results. The standardization of our data examples has normalized the product descriptions to the point that the six duplicate records in the Candy category, which were highlighted in the introduction, can now be easily identified as exact matches:

12

If the six duplicates were consolidated into a single record, then the E<3MC2 brand could be properly represented as the following three unique Acme Foods products:

Data quality tools support the advanced duplicate consolidation logic often necessary for selecting or constructing the consolidated record (aka “survivor” or “golden copy”). Obviously, exact matching on rigorously standardized data is neither a recommended best practice nor a limitation imposed by data quality tools, which provide advanced matching techniques for overcoming data variations and other data quality issues. Although those techniques are beyond the scope of this white paper, standardization will still play an important supporting role, especially for improving candidate selection for automated and interactive matching, as well as for searching the product catalog. Data quality tools also provide some way to rank their match and search results (e.g., numeric probabilities, weighted percentages, odds ratios or confidence levels) as a primary method in differentiating automatic matches, automatic non-matches and potential matches requiring manual review and verification by a SME.

After Matching After matching has performed duplicate identification and consolidation, the updated Acme Foods product catalog now has dramatically improved product data quality:

13

Searching and matching against this new internal standard product reference can prevent future duplicates from being added to the Acme Foods product catalog.

Summary Product data presents some challenges that are different from other data domains. The root cause is often the product description, which is usually unstructured, meaning that most product data attributes are buried within a free-form text field.

This white paper provided a data-example-driven perspective of some of the unique challenges of product data quality, as well as discussed and demonstrated the three critical steps to improving product data quality:

1. Categorization – Organizes product descriptions by category, aligning technical processes and business rules with subject matter experts (SMEs), and routes product descriptions to category-specific standardization rules

2. Standardization – In a two-step process, first separates the content of the product description into new fields, and second applies standard values. Implementing these steps separately makes it easier to apply different standards when appropriate (e.g., regional standards in a local language)

3. Matching – Identifies and consolidates duplicate products within a source, facilitates improved search capability and supports matching against an internal or external standard product reference

14

The fictional data examples from the Acme Foods product catalog demonstrated that I love Sugar Water and Everybody Loves Milk Chocolate Squared (E<3MC2).

But if there is only one fact that you take away from this white paper, let it be this one:

Everybody Loves High Quality Product Data.

To learn more about data quality, visit: dataflux.com/knowledgecenter/dq

www.dataflux.com

Corporate HeadquartersDataFlux Corporation940 NW Cary ParkwaySuite 201Cary, NC 27513-2792USA877 846 3589 (USA & Canada)919 447 3000 (Direct)[email protected]

DataFlux United KingdomEnterprise House1-2 HatfieldsLondonSE1 9PG+44 (0)20 3176 [email protected]

DataFlux GermanyIn der Neckarhelle 16269118 HeidelbergGermany+49 (0) 69 66 55 42 [email protected]

DataFlux FranceImmeuble Danica B21, avenue Georges PompidouLyon Cedex 0369486 LyonFrance+33 (0) 4 72 91 31 [email protected]

DataFlux and all other DataFlux Corporation LLC product or service names are registered trademarks or trademarks of, or licensed to, DataFlux Corporation LLC in the USA and other countries. Copyright © 2010 DataFlux Corporation LLC, Cary NC, USA. All Rights Reserved. Other brand and product names are trademarks of their respective companies.

DataFlux Australia300 Burns Bay RoadLane Cove, NSW 2066Australia+61 2 9428 [email protected]