4
PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER TECHNOLOGY WHITE PAPER

PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER … · algorithm called “Luhn” or “Luhn check”. Scanner Process: 1. Reads through files potentially containing PII 2. Checks

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER … · algorithm called “Luhn” or “Luhn check”. Scanner Process: 1. Reads through files potentially containing PII 2. Checks

PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER TECHNOLOGY

WHITE PAPER

Page 2: PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER … · algorithm called “Luhn” or “Luhn check”. Scanner Process: 1. Reads through files potentially containing PII 2. Checks

TABLE OF CONTENTS

INTRODUCTION

TECHNICAL INFORMATION

PATTERN CHECKING

NOISE / DELIMETER CHECKING

Page 3: PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER … · algorithm called “Luhn” or “Luhn check”. Scanner Process: 1. Reads through files potentially containing PII 2. Checks

3

PI I Scanner Technology | White Paper

Introduction

RSI Security’s Personally Identifiable Information (PII) Scanner is purpose-built to address the needs of professionals who handle client data & documents that potentially contain PII such as Social Security, Driver’s License, & Credit Card numbers. Identifying PII that electronically resides on computer systems is crucial to responsibly manage this most sensitive of client identity data.

PII such as Social Security numbers and Date of Birth are the foundation for individuals to create and verify an identity, and can determine our ability to acquire automobile or home loans, or even get a job. Given this significance, it is critical to protect the confidentiality of PII, and professionals who collect PII data in the course of handling their client’s finances have enormous responsibility to ensure that PII does not fall into the wrong hands.

The downstream consequences of inadvertent PII disclosure can prove catastrophic to both a company’s client (ID theft potentially requiring years to resolve) and to the company itself. As of this paper’s writing in Oct 2017, Equifax is currently embroiled in what may prove to be the most consequential data breach to date, with the PII of over 146M US citizens stolen and possibly released to the Dark web for untold numbers of hackers and criminals to access.

Consumers whose PII were stolen are likely to experience credit denials and ID fraud for decades to come, unless laws are changed to introduce / institute a new confidential identifier. Equifax itself is set up for potentially unlimited financial liability exposure and reputational damage.

Understanding the grave consequences of unauthorized PII disclosure, FINRA has for the past few years stressed the importance of managing & protecting PII:

1. Minimize initial collection and storage of client PII

2. Understand where PII is stored on computer systems or hardcopy

3. Assess whether the storage or transmission of PII is absolutely necessary

4. Encrypt the transmission of PII when communicating with authorized parties

5. Limit the exposure of PII to third-parties that have access to your systems

6. Purge PII whenever / where ever possible

These guidelines are intended to limit the chances that customer PII is accessed and ultimately exploited by bad actors.

To support these efforts, RSI has created specialized software that scans for PII that may exist on computer systems, and notifies the user on PII type and location (folder / directory it resides in). Armed with this knowledge, professionals can then reduce their security and liability scope by eliminating or encrypting client PII.

Technical Information

RSI scanner employs our own flexible scanning algorithm to improve detectability and minimize false positives, and studies files bit-by-bit to identify PII contained in a variety of file types: (txt, docx, xlsx, pptx, pdf, rtf, doc, xls, csv, ppt, odt, ods, odp, ibd, myd) located on a user’s hard drive.

The scanner’s capabilities are distinguished by its specific algorithms, templates, & accessed databases, which ultimately dictates its speed, sensitivity, detectability, and false-positive reporting levels. It employs flexible detection algorithms to maximize its sensitivity, also using pattern matching to minimize detection and reporting of false positives.

RSI’s PII scanner detects a broad spectrum of credit card types, from 13-digit to 19-digit US-based credit cards as well as a few European & Asian card brands.

Specific credit cards supported:

Visa: 13, 16, 19 digits Mastercard: 16 digits American Express: 15 digits Discover: 16 digits Maestro: 16, 18, 19 digits Dankort: 16 digits Solo/Switch: 16 digits JCB: 15, 16 digits InstaPay: 16 digits Enroute: 15 digits Diners Club: 14 digits

Page 4: PERSONALLY IDENTIFIABLE INFORMATION (PII) SCANNER … · algorithm called “Luhn” or “Luhn check”. Scanner Process: 1. Reads through files potentially containing PII 2. Checks

4

PI I Scanner Technology | White Paper

A credit card number’s length and its starting digits will vary by issuer, but all numbers must pass a checksum algorithm called “Luhn” or “Luhn check”.

Scanner Process:

1. Reads through files potentially containing PII

2. Checks whether a set of digits can be legitimate PII numbers

3. Runs a Luhn check for the number just scanned

4. Reports the number as a possible PII if it passes the checksum

While scanning the numbers (Step 2), to minimize false positives and improve efficiency, our scanner utilizes 1) Pattern Checking and / or 2) Noise / Delimiter Checking techniques.

Pattern Checking

To assess whether a potential VISA card number is legitimate, our scanner will analyze text files containing “4111 1111 1111 1111” (and other 16-digit numbers, thus passing Luhn check). If the number is found to be legitimate, it would be stored as either:

1. A number only format (i.e., 4111111111111111)

2. Spaced format (i.e., 4111 1111 1111 1111)

3. Dashed format (i.e., 4111-1111-1111-1111)

If the number’s preceding or following character is a digit, the chances of the number being a legitimate value would be quite low, meaning we would not store the (4111) VISA number as “123441111111111111112345”.

Noise / Del imeter Checking

Noise is defined as characters for the scanner to ignore, such as a space or dash which could be placed between digits of a legit number. Delimiters represent characters that scanner resets / restarts digit searching, as those characters cannot be placed between digits of the PAN, e.g., comma, semicolons, alphabets and many others.

Employing these techniques improve detectability & sensitivity of the scanner, but it is potentially

8 5 8 . 9 9 9 . 3 0 3 0

4370 La Jolla Village Drive • Suite 200 • San Diego, CA 92122

[email protected] | www.rsisecurity.com

About RSI SecurityRSI is the nation’s premier information security and compliance provider dedicated to helping organizations achieve risk-management success. We work with some of the world’s leading companies, institutions and governments to ensure the safety of their information and their compliance with applicable regulation. We also are a security and compliance software ISV and stay at the forefront of innovative tools to save assessment time, increase compliance and provide additional safeguard assurance. With a unique blend of software based automation and managed services, RSI can assist all sizes of organizations in managing IT governance, risk management and compliance efforts (GRC).

susceptible to the false positives. Because, after ignoring noises, if the digit length meets detection criteria (e.g., 16 for VISA), the scanner runs Luhn check and report the numbers, if it passes the checksum. This false positive issue becomes really serious when scanning SSNs, as SSN rules are quite loose and do not run any checksum.

Our scanner is based on the noise / delimiter checking technique, as it offers better sensitivity in scanning. It is also quite flexible in expanding to detect various PAN formats (e.g., 13-digit to 19-digit), as it does not check the pattern first. However, it showed false positive issues, especially with SSNs. Therefore, to minimize false positives, we also check some patterns after detecting numbers with the noise/delimiter checking technique. So for SSNs, we only report numbers in 1) ddd-dd-dddd, 2) ddddddddd, or 3) ddd dd dddd format.