RecordLinkage I: Evaluation ofCommercially Available Record … · 2017. 2. 24. · the Fellegi-Sunter record linkage theory and sophisticated string comparison functions in its matching

,

/.;,,=,,;;, United States;£W~\Department of,:Z_~j Agriculture

NationalAgriculturalStatisticsService

Research Division

ST8 Research ReportNumber ST8-95-02

October 1995

Record Linkage I:Evaluation of CommerciallyAvailable Record LinkageSoftware for Use in NASS

Charles Day

RECORD LINKAGE I: EV ALUA TION OF COMMERCIALLY AVAILABLE RECORDLINKAGE SOFTWARE FOR USE IN NASS, by Charles Day, Technology Research Section,Survey Technology Branch, Research Division, National Agricultural Statistics Service, U.S.Department of Agriculture, Washington, DC 20250-2000, October 1995, Report No. STB-95-01.

ABSTRACT

Record linkage is an important technique in NASS for minimizing the presence of duplicate nameson its list sampling frame of farm operators and agribusinesses. In the late 1970' s, NASSdeveloped an automated record linkage system which runs on an IBM mainframe for this purpose.With changes in technology, the need has arisen for portability between platforms, integrationwith client/server technology, and interactive operation. Also, NASS desires to reduce resourceexpenditures on record linkage while maintaining the quality of the process.

The growing availability of commercial record linkage solutions has made unnecessary thedevelopment of a new record linkage system or an expensive and difficult rewrite of the oldsystem. This report evaluates six commercially available record linkage software packages fortheir suitability for NASS's purposes. The report starts with a brief discussion of record linkagein NASS, then discusses the statistical theory behind the most popular probabilistic record linkagesolution, that of Fellegi and Sunter. Next, the report discusses the requirements for a NASSrecord linkage system. Detailed reviews of the six software packages follow. Except for thereview of AUTOMA TCH, which NASS has tested extensively, these reviews are based oninformation provided by the software manufacturers. The report concludes that, for NASS'spurposes, AUTOMA TCH is the best choice. The report ends with a glossary of record linkageterminology and a checklist for the evaluation of record linkage software packages.

KEY WORDS

Record linkage; List sampling frame; Duplication; Software

This report was prepared for limited distribution to the research community outside the U.S.Department of Agriculture. The views expressed herein are not necessarily those ofNASS orUSDA.

ACKNOWLEDGEMENTS

The author would like to thank the members of the List Sampling Frame Section for theirassistance in developing the requirements and explaining NASS's list sampling frame maintenanceprocedures, Kara Broadbent of the Ohio Research Unit for her work in testing AUTOMA TCH,and the representatives of the manufacturers of the software reviewed in this report for theirassistance in understanding their products.

Table of Contents

SL ':\.fl\..1AR Y v

:INTRODUCTION 1

Uses of Record Linkage in NASS 1

A Brief History of Record Linkage in NASS 2

Purpose of This Report 3

BACKGROUND 4

Comparison Outcomes and Methods of Comparing Character Strings 4

The Fellegi-Sunter Record Linkage Theory 6

Blocking 9

Standardization of Names and Addresses 10

RECORD LINKAGE REQUIREMENTS 10

General Requirements 11

Hardware and Software Environment 11

Record Linkage Methodology 12

Data Handling 13

Post-Linkage Functions 14

REVIE\VS OF SOFTWARE PACKAGES 15

AUTOMATCH/AUTOSTAN (MatchWare Technologies) 15

Generalized Record Linkage System (Statistics Canada) 19

SSA-Name3 and Extensions (Search Software America (SSA» 20

Smart PID (Advanced Linkage Technologies of America (ALTA» 22

Merge/Purge Plus and Code-l (Group 1 Software) 23

Merge/Purge 4.3, Address Correction and Encoding (ACE), and True Name (Postalsoft) 24

CONCLUSIONS 25

RECOJ\tmfENDA TIONS 26

111

REFERENCES 27

APPE~l>IX A: GLOSSARY 30

APPENDIX B: A CHECKLIST FOR EVALUATING RECORD LINKAGE 33SOFTW ARE

IV

SUl\1MARY

Record linkage is a technique for associating records representing the same unit from one or morefiles representing the same population by comparing identifiers. The National AgriculturalStatistics Service (NASS) uses record linkage to minimize duplication of names of farm operatorsor operations during construction and maintenance of its list sampling frame (list frame) offarmers and agricultural operations. NASS's current record linkage solution, which NASS useswith the Real Time Mail Maintenance System (RTMMS), is a set of programs written in COBOLand FORTRAN running on an IBM mainframe computer at Lockheed Martin Corporation. TheEnhanced List Maintenance Operations (ELMO) system is replacing the RTMMS. Like theRTMMS, ELMO will need a record linkage system. Over the years, NASS has becomeconcerned with the high level of resources associated with using its current record linkage system.Also, NASS needs to integrate applications with client/server technology, and to have applicationsthat are portable from platform to platform. The agency also anticipates a need to perform recordlinkage interactively. NASS anticipates that porting its current record linkage system to newplatforms would be difficult and expensive. With the advent of affordable, commerciallyavailable record linkage software, these packages have become the best candidates for the coreof any replacement system. For all of these reasons, NASS decided to explore the use ofcommercially available record linkage software. This report is an evaluation of severalcommercially available packages. A second report, entitled "Record Linkage II: Experience UsingA UTOMATCH for Record Linkage in NASS," detailing NASS's experience with the AUTO-MATCH record linkage software package, is also available. This second report coversimplementation issues which are not covered in this report.

NASS's current record linkage solution was an early application of the Fellegi-Sunter recordlinkage theory. Based on a 1969 JASA paper by Ivan Fellegi and Alan Sunter, this theory hasbecome the most widely accepted record linkage method. The best of the commercially availablepackages use it. The current linkage system uses the New York State Intelligence InformationSystem (NYSIIS) phonetic code to block records into groups for comparison to improveefficiency. The commercially available record linkage systems also use blocking, but allow formore blocking factors and multiple passes. They also allow comparison of alphabetic stringsusing NYSIIS and sophisticated string comparison metrics. Another important task of a recordlinkage system is parsing free-formatted name and address fields into their component parts. Thecurrent system does this using a complicated set of data manipulation and reformatting programs.The commercially available systems offer varying degrees of ability to perform this task. Someof the systems have extremely powerful, flexible systems; others have no standardizationcapability.

At the beginning of this project, many people were involved in developing requirements for theuse of new linkage software in NASS. The new record linkage system must be statisticallydefensible. It must be low in both initial cost and development costs and user friendly for endusers in State Statistical Offices (SSOs) or Headquarters. The software must be well documentedand supported by a reliable vendor. The software must run on the hardware platform and under

v

the operating system chosen for linking applications. In addition, it must be compatible with theSybase database package, and must run in batch and interactive modes. The software needs toallow for flexibility in blocking and in the use of different linking variables. It must be able tomake efficient use of the information in NASS list frame records. The software should allow thechoice of appropriate comparison functions for the different types of matching data available onthe list frame. If the system uses the Fellegi-Sunter method. it must provide utilities forestimating m- and u-probabilities, and for allowing the user to set weight cutoffs. (Terms of art,like m-probability, are defined in the glossary in Appendix A.) The software must also handlefiles at least as large as NASS's largest list frame file. Any software chosen for NASS to useneeds to handle the parsing and standardization of free-formatted name and address information,such as is found in NASS's list sources. Finally, the software must produce any desired reportsor extract files of matches and nonmatches.

Six software packages were chosen for evaluation. Due to budgetary and resource constraints,only AUTOMA TCH was purchased for testing. Reviews of specifications and discussions withdevelopers were considered sufficient for evaluating the remaining packages. AUTOMA TCH isa generalized record linkage system developed by MatchWare Technologies. The package usesthe Fellegi-Sunter record linkage theory and sophisticated string comparison functions in itsmatching process. It is available for all of the platforms considered by NASS. It also comes witha powerful, customizable package for parsing and standardizing names and addresses. TheGeneralized Record Linkage System (GRLS) is a product of Statistics Canada. It also is ageneralized, Fellegi-Sunter based record linkage package. Statistics Canada developed thispackage to be used with ORACLE databases, and GRLS derives much of its power and flexibilityfrom the capabilities of ORACLE. This reliance on ORACLE makes it inappropriate for use witha Sybase database. Smart PID. a product of Advanced Linkage Technologies of America, is aset of PASCAL and C language modules that performs record linkage tasks using an "enhanced"version of the Fellegi-Sunter methodology. These modules are primarily designed for use inhospital Management Information Systems (MIS) environments and do not offer the flexibilityneeded for NASS's purposes. SSA-Name3 and Extensions are similarly a set of softwaremodules. SSA-Name3 is a package for constructing efficient database searching keys based onNYSIIS phonetic codes and frequencies of the codes' occurrence. The Extensions packageprovides tools to score records selected by the SSA-Name3 product based on a scheme of variablecomparisons and weighting constructed by the user to select matches. SSA-Name is not astatistically based matching system, and is expensive. It would, in addition, be expensive to createa system from its components. Both Merge/Purge Plus from Group 1 Software and Merge/Purge4.3 from Postalsoft are nonstatistical matching systems best suited to their intended use ofdeduplicating mailing lists in turn-key direct mail solutions. Figure 3 on page 16 summarizes thereviews of the software packages.

The report recommends that AUTOMA TCH be chosen as the core component of the agency'snext record linkage system. AUTOMATCH is the least expensive of the packages, yet it offersthe most functionality. It will require the fewest resources to meet NASS's specifications, andit is based on the proven Fellegi-Sunter record linkage methodology. No other package has a

VI

name and address standardization capability as powerful as that offered by the AUTOST ANsoftware. It is available on all of the platforms that NASS is considering for record linkage.Finally, it is available in a form that NASS can use to create an interactive record linkagecapability.

This report is not a general comparison of the six record linkage software packages. Therecommendation for AUTOMATCH is based solely on NASS's particular requirements.This report does not endorse one software package over any other for uses outside NASS.

VB

INTRODUCTION

What is record linkage? Record linkage, ormatching, does not have a single definition.In their seminal 1959 paper in Science [1],Newcombe, et al. quote H. L. Dunn and].T. Marshall, saying, "The term record link-age has been used to indicate the bringingtogether of two or more separately recordedpieces of information concerning a particularindividual or family. Defined in this broadmanner, it includes almost any use of a fileof records to determine what has subse-quently happened to people about whom onehas some prior information." Fellegi andSunter, in their 1969 Journal of the Ameri-can Statistical Association (JASA) article[2], state, "The necessity for comparing therecords contained in a file LA with those ina file LB in an effort to determine which pairsof records relate to the same population unitis one which arises in many contexts, most ofwhich can be categorized as either (a) theconstruction or maintenance of a master filefor a population, or (b) merging two files inorder to extend the amount of informationavailable for population units represented inboth files." In its 1980 Report on Exact andStatistical Matching Techniques [3], theSubcommittee on Matching Techniques ofthe Federal Committee on Statistical Method-ology (FCSM) said that, "Although the terms'match, , 'exact match,' and 'statisticalmatch' have been used frequently in theliterature, the Subcommittee knows no gen-erally agreed upon definitions of these terms.For purposes of this report, the Subcommit-tee has defined a match as a linkage of re-cords from two or more files containing unitsfrom the same population. It has defined anexact match as a match in which the linkageof data for the same unit (e. g., person) fromthe different files is sought; linkages for units

that are not the same occur only as the resultof error. Exact matching normally requiresthe use of identifiers, for example, name,address, social security number. "

In this report we will borrow from all thesedefinitions. As used here, "record linkage"means the association of records representingthe same unit from one or more files repre-senting the same population by comparingidentifiers for construction and maintenanceof a master file for a population. Note thatthis definition allows for a linkage betweenrecords in the same file; this special case willbe called unduplication. This definition alsofits the FCSM's definition of an exact match,meaning that the linkage of records that donot represent the same unit in the populationis an error; similarly, the failure to link tworecords that do represent the same unit in thepopulation is also an error. The first type oferror will be called a false link, the secondtype a false nonlink. A glossary of recordlinkage terminology can be found in Appen-dix A.

Uses of Record Linkage in NASSNASS uses record linkage primarily for themaintenance of the list sampling frame (re-ferred to after this as simply the list frame).To make accurate estimates, it is importantthat no unit in the population be representedmore than once on the sampling frame, thatis, that no duplication be present on theframe. If undetected duplication is present,then the probabilities of selection for eachunit in the population will be calculatedincorrectly and estimates produced usingthese probabilities will be biased.

NASS constructs its list frame from sourcelists of agricultural operators and operations.These source lists have various origins;

among the most important are the FarmService Agency (FSA) lists of operators whohave signed up for agricultural price supportprograms, lists of members of agriculturalproducers' associations, lists from StateDepartments of Agriculture, and lists fromveterinarians associations. Operators areoften listed on more than one of these lists, acommon source of duplication on the listframe. NASS also updates the list frame bythe addition of individual records represent-ing operators or operations and by the alter-ation of existing records. These updates canalso result in the creation of two records onthe frame representing the same unit in thepopulation. NASS uses record linkage todetect these duplicate records.

Before NASS adds a new list source to its listframe, it uses record linkage to compare thesource to the list frame and detect any dupli-cation between the two lists. The operator oroperation names not already on the frame arethen added to the frame. To detect duplica-tion created by the alteration of recordsalready on the list frame, NASS uses recordlinkage to compare the list frame to itself (anexample of unduplication). This is doneannually with a simplified program whichmatches records on Social Security Number(SSN), Employer Identification Number(EIN), or telephone number before samplingfrom the frame. States request a completeunduplication using the current recordlinkage system every few years or so.

,4 Brief History of Record Linlwge in NASSNASS first began using lists of agriculturaloperators and operations for probabilitysurveys in the early 1960's; at that time itused different lists for different surveys. Inthe early 1970's, NASS began to developsoftware for a single list frame for each state

2

for use in all probability surveys. By themid-1970's. an effort to develop computersoftware to support that work was underwaywithin NASS. Among the systems developedduring that time was the record linkagesystem [4]. Incorporating the work of adozen people over a period of four years,the record linkage system was based on therecord linkage theory of Ivan Fellegi andAlan Sunter (the Fellegi-Sunter theory,described in a later section of this report).Some 20 years later this is still the mostwidely accepted probabilistic method forlinking records. The record linkage systemconsists of a set of COBOL and FORTRANprograms \vhich runs on an IBM mainframeat Lockheed Martin Corporation under atime-sharing arrangement. NASS built thefirst state list frames in 1977 from sourcelists using this system. The system was thenenhanced, and several states had list framesbuilt in 1979. By 1982, all states had a listframe built using the record linkage system.In 1983, NASS put a new database, the RealTime Mail Maintenance System, in place tohandle the list frame. The record linkagesystem is used now to update this database.

As conceived, the record linkage systemrequired several passes through the data,with manual review steps between. TheReformat 1. Reformat 2, Reformat 3, andData Manipulation stages put the names andaddresses on the list in a standard format,and tested the state, place name, ZIP Code.This standardization process is critical to anyrecord linkage method. It allows the com-puter to "find" all of the parts of a name oran address, so that similar parts may becompared between records. Without such aprocess, it is not possible for the software tocompare the records in any meaningfulfashion. After the user had successfully

standardized the records, an attempt wasmade to perform what was referred to as anidentical match. This procedure did astraightforward comparison of the identifierson each record and, if they were identical,linked the records. This was done to im-prove processing efficiency by eliminatingidentical duplicate records before the remain-der of the processing. Next, the records thatremained after the identical match werelinked using the Record Linkage Subsystem(RECLSS). RECLSS used the Fellegi-Suntermethodology. It employed the NYSIIS pho-netic coding system to compare names and todo blocking (described in a later section ofthis report). The system then made an at-tempt to associate the groups of records thatlinked into larger, related groups. Finally,the results of the linkage were reviewed byState Statistical Office personnel.

This original procedure, with its many stepsand manual review requirements proved toocostly to implement, as too many labor re-sources were required. As it is now used,the system makes a single pass through thedata. Records that have standardizationerrors are discarded. The remainder is runthrough identical match and RECLSS. If thelinkage is being done to add a new list sourceto the list frame, any records that do not linkare added to the list frame as inactives, thatis, records not subject to sampling.Additional information is collected on theserecords and they are made active (subject tosampling), or a decision is made that therecord should be deleted after this informa-tion has been reviewed. If the linkage isbeing done to unduplicate a single file, thenthe results are sent to the State StatisticalOffice as a (usually very long) computerprintout for review. Streamlining the pro-

3

cess in this fashion has led to a higher levelof linkage errors than originally envisioned.

Purpose of This ReportIn the early 1990's NASS decided that it wastime to replace the Real Time Mail Mainte-nance System. The new system, dubbed theEnhanced List Maintenance Operationssystem (ELMO), is to be maintained in aSybase database, using client/server tech-nology. At the time the project documentedin this report was begun, the ELMO Teamhad not decided whether each State StatisticalOffice would have a UNIX server with itsown list frame or if there would be a cen-trally located UNIX server on which all 50states' frames would reside.

The replacement of the RTMMS precipitateddiscussion of the need for replacement of thecurrent record linkage system. The need forportability between platforms, integrationwith client/server technology, and interactivelinkage, combined with a desire for a systemwhich could be used as designed with fewerpersonnel resources, led to the decision toexplore new record linkage solutions. Theprojected cost and difficulty of either devel-oping a new system or porting the existingsystem to new platforms, led to the explora-tion of commercially available record linkagesoftware.

The purpose of the project documented inthis report is to locate, evaluate, and developsoftware to replace NASS' s current recordlinkage system. This report documents thefirst stage of that project, the location andpreliminary evaluation of commerciallyavailable record linkage software for currentand future NASS needs.

A second report, entitled "Record LinkageII: Experience Using A UTOMATCH inNASS," covers the evaluation of AUTO-MATCH's suitability for NASS's recordIinkage tasks, and recommendations foradditional research and development workprior to implementation.

BACKGROU~

Comparison Outcomes and Methods ofComparing Character StringsOne of the key concepts in the Fellegi-Suntertheory is the idea of a comparison outcome.Simply stated, a comparison outcome is theresult of comparing the values for the sameidentifier on two different records. Figure 1summarizes the Fellegi-Sunter theory. InFigure 1, y(firSl nJrne) is an example of a com-parison outcome. Fellegi and Sunter discussvectors of such outcomes, with each elementof the vector corresponding to a differentlinking variable. y n is an example of avector of comparison outcomes, with ele-ments corresponding to first name, surname,address, SSN, phone number, etc. Butbefore we discuss these vectors, it is impor-tant to understand how richly Fellegi andSunter define the idea of a single comparisonoutcome.

At first it might seem that only two outcomesare possible for each variable, that the valueson the two records agree or they disagree,but, in reality, the set of possibilities is muchmore varied. Consider the identifier gender.Only two outcomes seem possible, that thetwo records agree on the value of gender orthat they disagree. But what if gender ismissing on one of the records? A missingvalue does not imply either agreement ordisagreement, but is simply a third possibil-ity.

4

Now, think about a variable like surname.Agreement between Smith on one record andSmith on a second record gives some confi-dence that the records represent the sameindividual in the population, but there maybe many Smiths in a file of names, and suchan agreement may occur readily by chance.On the other hand, agreement on the sur-name Fellegi gives much higher confidencethat the records represent the same individ-ual; it would be unlikely for such an unusualname to occur by chance on both records.

The Fellegi-Sunter method is a probabilisticone. It is concerned with estimating, insome rigorous way, the likelihood that tworecords represent the same individual. If onewishes to use the outcomes of his or hercomparisons 1:0 estimate such a likelihood,he or she must distinguish the outcome"surname is Smith and agrees," from theoutcome "surname is Fellegi and agrees."These have 1:0 be considered distinct out-comes, since they infer different probabilitiesthat the records represent the same unit in thepopulation. So, values of a comparisonoutcome, like y(first namel, look like "name isJohn and it agrees," rather than simply"agrees" or "disagrees."

But the complexities do not end there. Sup-pose one compares alphabetic strings, likenames of people or streets. Do Fellegi andFelegi really disagree, or is one of them aspelling error? A couple of solutions havebeen proposed to this problem. One is toconvert the strings using a phonetic codingscheme (one that converts "sound-alike"strings to the same string) and then comparethe results of the conversion. Two suchphonetic coding schemes are called Soundexand NYSIIS [5]. Using these codes formatching, one converts each name to its

Figure 1. The Fellegi-Sunter Theory

Y n = (y (first name), Y(surname), y (addresS), y (ssn), y (phone number), .•• ), where y (first name) is the result ofcomparing the value of first name on two records, one from List A and one from List B,y (surname) is the result of comparing the surname on those two records, and so forth. y (first name)

is called a comparison outcome and Y n is called an outcome vector.

r = {y 1 , ... ,YN} is the set of all possible vectors of outcomes of comparing two records.In practice, this set contains millions of elements. r is sometimes called the comparisonspace.

Each Y n has two conditional probabilities associated with it:m-probability = P( Y n I the records being compared represent the same population unit)u-probability = P(y n I the records don't represent the same population unit)

To achieve a false link rate of p. and a false nonlink rate of A with a minimum number ofrecord pairs for human review, first, order the y's in descending order of m-probability/u-probability. Then sum u-probabilities, beginning with y] until the sum of the u-probabilities equals p. (at y 1'). Next, beginning with y N' sum the m-probabilities until theytotal A (at y>). This partitions r, and yields the following decision rule:

Y1 (highest m/u)

Y2Y3

-y}J.

YN-3

YN-2

YN-]

YN (lowest m/u)

5

Links ([ u-prob = I-L)(Pairs generating these outcomevectors are linked.)

Possible links(Pairs generating these outcomevectors are reviewed manually.)

Nonlinks ([ m-prob = A)(Pairs generating these outcomevectors are not linked.)

code, and compares the codes to see if theyagree. In this kind of a scheme, agreementor disagreement on each of the differentpossible codes are all considered distinctoutcomes. This is the approach used in thecurrent record linkage system.

A newer approach addresses the question,..How sure are we that two strings are thesame?" Considering Fellegi and Felegi, onemight be nearly certain that they are thesame. The NYSIIS code for both of thesenames is FALAGA. What about Faleggi, orFulleggi? The NYSIIS code for these namesis FALAGA as well, but they seem, in somesense, farther from Fellegi. A string com-parison function assigns a "weight" (not tobe confused with a match weight in theFeIlegi-Sunter theory) to the comparison oftwo strings based on the lengths of thestrings, and the number of insertions, dele-tions, transpositions, and replacementsrequired to turn one string into the other.One can use this "weight" to adjust theprobabilities that two records represent thesame unit based on an agreement. In otherwords, these weights can be used to definepartial agreements. If the two strings areidentical, we use the full agreement probabil-ity, but if the two strings vary, we usesomething less than the full agreementprobability, and the more the strings dis-agree, the more the probability is reduced.This type of character string comparison alsohas the advantage of being insensitive to thestructure of the strings; for example, it is asuseful for Oriental names, where most of thediscriminating power is in the vowels, as it isfor Western European names, where more ofthe discriminating power is in consonants.

To review, a comparison outcome is theresult of comparing the value of an identifier

6

on one record to the value of the same identi-fier on another record. Agreement or dis-agreement on different values representdistinct outcomes, and, for character strings,it is possible to differentiate degrees ofagreement and disagreement into differentoutcomes. This leads to the conclusion that,for anyone identifier, such as y(first name) inFigure 1, there may be many thousands ofdifferent possible outcomes.

The Fellegi-Sunter Record Linkage TheoryIn 1969, Ivan Fellegi and Alan Sunter ofStatistics Canada published a paper in theJournal of the American Statistical Associa-tion (JASA) entitled "A Theory for RecordLinkage." This paper provided the theoreti-cal underpinning for much of NASS's cur-rent record linkage system. The theorycontained in it (closely related to one sug-gested by Benjamin Tepping in a 1968 JASAarticle entitled "A Model for Optimum Link-age of Records" [6]) has become the mostwidely accepted probabilistic linkage meth-od. (Other, information-theoretic approach-es to record linkage have been suggested [7,8, 9], but have not gained popularity.)Several of the software packages evaluated inthis report use the FeIlegi-Sunter method.

FeIlegi and Sunter began by assuming theexistence of two files of computer records.They referred to the files as List A and ListB. They assJmed that both files containedrecords representing units from the samepopulation. The authors developed a theoret-ical approach for using a computer to associ-ate each record in List A with the record orrecords in List B which represent the sameunit in the population. This approach has theproperty that a minimum number of recordpairs have to be reviewed by human beings

to achieve given levels of false link and falsenonlink.

The authors assumed that each record in FileA would be compared to each record in FileB using a set of identifiers (like name, ad-dress, social security number, telephonenumber). For each pair of records, therewould be a comparison outcome for eachidentifier. In Figure 1, yefirSl name)is an exam-ple of a comparison outcome. Taken to-gether, these outcomes would form a vector(an ordered n-tuple) of comparison out-comes. In Figure 1, Yn is an example ofsuch a vector. Next, they considered all ofthe possible outcome vectors; that is, all ofthe possible combinations of outcomes ofcomparing the identifiers on two records.The set r in Figure 1 is an example of sucha set. For each vector, that is, each combi-nation of outcomes (each of the Yn's in Fig-ure 1), they considered two conditionalprobabilities. The first, which they calledthe m-probability, is the probability that thisoutcome vector occurs given that the recordsbeing compared represent the same unit inthe population (that is, the pair of recordsbelongs to the matched set). The second, theu-probability, is the probability that thisoutcome vector occurs given that the recordsbeing compared don't represent the sameunits in the population (that is, the recordpair belongs to the unmatched set).

Fellegi and Sunter then suggested a generalform for a decision rule, which divides theset of outcome vectors into three subsets.The first of these is the linked subset, forwhich the computer makes the inference thatany record pair associated with one of theseoutcome vectors belongs in the matched set(that is, based on the outcomes of the com-parisons of the linking variables, the com-

7

puter infers that the records represent thesame unit in the population). This subsetcorresponds to the subset marked "Links" inthe diagram at the bottom of Figure 1.

The second group is the unlinked subset, forwhich the computer makes the inference thatany record pair associated with one of theseoutcome vectors belongs in the unmatchedset (that is, based on the outcomes of thecomparisons of the linking variables, thecomputer infers the records do not representthe same unit in the population). This subsetcorresponds to the subset marked "Nonlinks"in Figure 1.

For the third subset, which Fellegi andSunter called the possible links, the computermakes no inference; record pairs with theseoutcome vectors are referred for humanreview. This subset corresponds to thesubset marked "Possible links" in Figure 1.

The heart of the Fellegi-Sunter theory is atheorem that states that, for given levels offalse link and false nonlink (designated JJ. andA in Figure 1), the decision rule can bedefined in terms of the ratios of the m- andu-probabilities so that the set of possiblelinks is as small as possible, that is, thesystem automatically makes decisions aboutas many record pairs as possible. (This opti-mality criterion, a minimum size for the setof records to be reviewed by humans, makessense when one considers that the goal ofautomating record linkage is to let the com-puter do as much of the work as it can.)Fellegi and Sunter began by forming theratio of the m-probability to the u-probabilityfor each vector of outcomes (each Yn)' Theythen ordered the set of outcome vectors onthis ratio. To form the linked subset, theybegan at the top (highest mlu ratio--the

outcome vectors most likely to represent atrue match) and, in descending order of mlu,summed the u-probabilities (the probabilitiesthat, in fact, record pairs with these compari-son vectors do not represent the same unit)until they reached the given false link rate.All of the vectors which went into this sum(the vectors from Y1 to y I' ) fall into thelinked subset; record pairs that generate thesevectors when compared are linked.

To form the unlinked subset. they began atthe bottom (lowest mlu ratio--the outcomevectors least likely to represent a true match)and. in ascending order, summed the m-probabilities (the probabilities that, in fact.record pairs with these comparison vectorsdo represent the same unit) until theyreached the given false nonlink rate. All ofthe vectors which went into this sum (thevectors from YA to y]\;) fall into the unlinkedsubset; record pairs that generate these vec-tors when compared are not linked. Every-thing left over (everything in the middle, allof the vectors from YI' -r 1 to YA -I) representedthe possible links. Pairs that generated thesevectors require human intervention to deter-mine their link status.

It may not be intuitively obvious to the read-er that this decision rule results in the mini-mum size set of possible links for the givenlevels of error. However. the goal of thispaper is to present only as much of the the-ory as necessary to discuss its application insoftware packages. The ambitious readerwill find a rigorous proof in Fellegi andSunter's JASA article.

At this point, an attentive reader might stillask, "All of this is fine, but. to get the m-and u-probabilities. don't you already have toknow the answer--that is. to know what

8

records are in the matched and unmatchedsets, so that these probabilities can be calcu-lated? ... And if many variables are used andthey can each have thousands of possibleoutcomes. isn't the set of possible outcomevectors unmanageably large--so large thatestimating the m- and u-probabilities from areasonable sized data set is impossible?"

These are both reasonable criticisms; how-ever. Fellegi and Sunter suggested a methodfor making the theory operational thataddressed these concerns. Fellegi and Sunterbegan with the idea that, for each recordpair, they could reasonably approximate themlu ratio for that pair's outcome vector.

First, the m- and u-probabilities associatedwith each possible comparison outcome canbe estimated For example, m- and u-proba-bilities can be estimated for each possiblevalue of YISufnJrnel ; the value "name is Smithand agrees" might have an u-probability of0.05, while the value "name is Rybczinskiand agrees" might have a u-probability of0.001. But how can these probabilities beestimated without knowing the results?

These estimates can be made as follows.Consider the idea of an m-probability, theprobability that, given that the two recordsrepresent the same unit in the population, thegiven outcome occurs. The only way identi-fiers, like name, address, social securitynumber, phone number, and so forth, wouldnot agree on two records representing thesame population unit is if they were in error.So, the m-probability of the outcome "thevalue of the variable is "x" and agrees" is justl-q, where q is the joint probability of anerror in the value "x" on either record.Therefore, the m-probability for an agree-ment on these records is l-q; likewise, the

m-probability for the outcome "value is "x"and disagrees" is q. One can easily see thatm-probability is really a measure of accu-racy.

Ideally, independent estimates of q will beavailable, but this is usually not the case.Some of the software packages reviewed inthis report have utilities that allow the esti-mation of m-probabilities using iterativemethods, beginning with an approximation ofq made from experience. These methodsusually perform fairly well even when theinitial guess for q is poor.

Now consider the u-probabilities. A u-probability is the probability of getting anoutcome given that the record pair is in theunmatched set. If two records are in theunmatched set, the only way they agree onan identifier is by random chance. So, the u-probability for an agreement is just p, wherep is the proportion of records in the file withthat value for the identifier. The u-probabil-ity for a disagreement is just the complementof that for an agreement, I-p. Since the u-probability varies with the distinctiveness ofa value, it is a measure of discriminatingpower. Fellegi-Sunter systems usually haveutilities for estimating u-probabilities bymaking frequency counts of the files to belinked.

We now have estimates of m- and u-proba-bilities for individual comparison outcomes.How do we get from this to an estimate forthe whole vector? The answer is to make asimplifying assumption, one that is importantto remember when applying the Fellegi-Sunter theory. The identifiers are assumedto be independent. Using this assumption,the mlu probabilities for the individuallinking variables can be multiplied together

9

(since the outcomes are independent events)to get an mlu ratio for the entire vector. Inpractice, these probabilities are converted tologarithms and the logs are added. Theresulting log of the mlu ratio is called alinkage weight.

In practice, rather than ordering the mluratios and setting cutoffs, one assigns cutoffweights to form a decision rule, and assumesrecord pairs with weights above the linkedcutoff are matches. Record pairs withweights below the unlinked cutoff are as-sumed to be unmatched. Weights betweenthe cutoffs are mapped to the possible linkedsubset.

BlockingAnother important linkage concept is block-ing. Theoretically, it would be ideal tocompare all possible record pairs, but,unfortunately, this problem is com-putationally intractable. Blocking is a meth-od for reducing the number of comparisonsto be made by using a highly accurate vari-able with moderate discriminating power todivide the files being compared into smallergroups, or blocks. Records are then com-pared only within blocks that have the samevalue for the variable. For example, onemay use the NYSIIS code of the surname toblock two files. (In addition to its use incomparing character string linkage variables,NYSIIS coding is also used for blocking.Research at NASS in the 1970's confirmedthe superiority of NYSIIS over Soundex forblocking purposes [10].) Records in file Awith a particular NYSIIS code for theirsurnames would then be compared only torecords in file B with that same surnameNYSIIS code. This greatly reduces thenumber of comparisons to be made, with theresult that the linkage problem is then com-

Figure 2.--ResuIts of Standardizing Names in Records A and B

Title:First Name:Surname:Suffix:

Record AMrRobertSmithJr

Record B

RobertSmith

putationally tractable. A simple arithmeticexample will suffice to show why. If twofiles of 10,000 records each are to be linked,the number of possible comparison pairs is10" x let or 100 million pairs. If each filecan be blocked into 100 blocks of 100records each, then the number of compari-sons to be made is 100 x (100 x 100) or onemillion comparisons. If the latter problemtook three hours to solve on a computer, thenthe former would take two weeks.

Standardization of Names and .4ddressesIn the record linkage context. standardizationrefers to the ability to parse a characterstring, like a name or an address, into itscomponent parts, place these parts intoidentifiable fields, and then convert each ofthese parts to standard forms or abbrevia-tions. This is necessarily done before com-paring the records for linkage purposes. Theimportance of this transformation cannot beoverestimated. Usually, several parts composean individual's name or address. These partsmay occur in different orders, be abbreviateddifferently, or be omitted. For example thenames "Smith, Mr. Bob, Jr" and "RobertSmith" may appear in two records (call them Aand B respectively). "Smith, :Mr. Bob, Jr." iscomposed ofa surname "Smith," a title, "Mr.,"an abbreviation of a first name, "Bob," and asuffix, "Jr." "Robert Smith" is composed of an(unabbreviated) first name, "Robert," and asurname "Smith." Without the ability toidentify the different parts of a name and to

10

place standard abbreviations for those partsinto fixed fields, one could only compare thetwo character strings to see if they were thesame Comparing R to S, 0 to m, b to i, etc.would cause the computer to believe that thetwo name fields disagreed. With the ability tostandardize names, the two names wouldappear as shown in the box above. The com-puter could then compare each correspondingpart. It would discover that it had two miss-ing values and two perfect agreements, ratherthan a disagreement on a single long string.

RECORD LINKAGE REQUIREMENTS

To evaluate record linkage software, it wasnecessary first to develop a set of require-ments based on NASS I S specific needs. Inlate 1992 and early 1993 the TechnologyResearch Section drafted a set of require-ments based on their knowledge of recordlinkage, the current NASS record linkagesystem, and the requirements of the ELMOsystem. This document was shared with sixstate list frame statisticians [11], the ListFrame Section, the ELMO Team, and threepeople who had been heavily involved in thedevelopment of NASS's current recordlinkage system, Bill Arends (Director,Statistical Standards Staff), Dick Coulter(Wyoming State Statistician), and Ben Klugh(Arkansas State Statistician). The require-ments were then revised, based on theircomments. Key topics from this require-ments document are discussed below. A

more general framework for comparingrecord linkage software packages, not spe-cific to NASS's needs, is contained inAppendix B.

General RequirementsThis section covers general requirements,including statistical defensibility, cost, andvendor support.

Statistical Defensibility-- When NASS makesan estimate it must be statistically defensible.Whatever definition one adopts for statisticaldefensibility, it must require, at a minimum,that NASS be able to describe what was doneat each stage of the process of forming anestimate. Since using record linkage toeliminate duplication from the samplingframe is a part of ensuring unbiasedness, it isimportant for NASS to be able to describe indetail how each stage of record linkageworks. This makes "black box" or propri-etary linkage systems unsuitable for NASSt spurposes. It also makes "ad hoc" systems,whose statistical properties are not under-stood, unsuitable.

Cost and Simplicity--A record linkage systemthat is too costly, complicated, or difficult touse is of little value, no matter its technicalsophistication. Initial affordability, lowoperational cost, and ease of use are designrequirements for any new linkage system.

Documentation and Training--Any newsoftware should come with complete userdocumentation. In addition, training shouldbe provided by the vendor, at least to a coregroup of users who can train others.

Maintenance and Suppon--Any commer-cially available software incorporated intothe system needs to be well documented and

11

supported by a reliable vendor. Updates tothe software should be available at reason-able cost as they are developed.

Readiness for Use--Resources for additionaldevelopment are scarce. Whatever com-mercial off-the-shelf (COTS) software ischosen, it should be as nearly ready to meetNASS's requirements as possible withoutmuch additional systems development.

Hardware and Software EnvironmentThis section covers the impact of the hard-ware and software environment in whichrecord linkage software will run and themodes in which the software will be used.

Computer Hardware-- The choice of a hard-ware platform for record linkage is depend-ent on the choice of a platform for ELMOand on a decision by the ELMO Team re-garding where record linkage is to run.Ideally the agency should choose recordlinkage software that is available on a varietyof platforms to allow flexibility in implemen-tation strategies. Any new record linkagesoftware should run on a minicomputer,workstation, or desktop computer.

Operating System and Database Software--The ELMO Team has chosen Sybase data-base software running under a UNIX ope rat -ing system. Ideally, any new record linkagesoftware can use embedded Structured QueryLanguage (SQL) code to directly access thedatabase, interface with Sybase, run underUNIX or DOSlWindows. (Note: Sybaseoffers a product called "Open Client" whichconsists of a library of C functions which canbe used by a C program to generate SQL foraccessing the database.)

Real-Time Duplication Checking--A long-term goal of any new software is to be avail-able to link single-record additions or modi-fied existing records to the database to checkfor duplication in real time. The systemmust run fast enough to make such real-timeduplication checking realistic. If software ispurchased from an outside vendor, it mustsupport interactive real time duplicationchecks.

Capability of Running in Batch Moden Therecord linkage system should run in batchmode. This is necessary to accomplishIinkages using large files \vith a reasonableexpenditure of resources.

Record Linkage MethodologyThis section covers methodological require-ments for record linkage software.

Fler:ible Blocking-- The software should allowusers to choose blocking variables. Further,blocking should be done so that the addition-al linkage error due to the blocking can beestimated, to minimize block size withoutincreasing false nonlinks to an unacceptabledegree.

Flexibility in Use of Linking Variables--Somelist sources may be richer than others inpotential linking information. Beyond a coreset of identifiers (which should be the sameas the required set of identifiers for databaseintegrity on the ELMO database) the systemshould have some flexibility as to what fieldsto attempt a match on. Also, the systemshould handle missing linking variables.Ideal Iy, users will choose the treatment theywant for missing variables.

Efficient Use of Available Informatz'on--Anysoftware used by NASS should use all of the

12

identifying variables available. Further, itshould partition the variables to extract themaximum amount of discriminating power.(For example, a name should be partitionedinto all of its parts, as discussed in the sec-tion on standardization above, or a phonenumber should be partitioned into the areacode, exchange, and last four random digits.)Information about resolution of previouslinkage runs on the same file(s) should beavailable. If a record pair is considered tobe a possible link by the record linkagealgorithm, but has previously been deter-mined to be a link or nonlink in earlierresolution work. and no changes have beenmade to either record, then the system shouldact on this information to eliminate the pairautomatically from the manual review group.

Choice of An Appropriate ComparisonFunctz'on--A key element of the record link-age problem is recognizing when two valuesfor equivalent variables that are being com-pared are the s.ame. This may not be trivial,due to problems such as finding equivalentvariables in files from two different sources,recognizing equivalent values (say, a propername and a nickname), or recognizing thatone variable contains the same value asanother variable. but with some sort of errorintroduced (like a transposition or a droppeddigit). A new record linkage system musthave an efficient and flexible way of makingcomparIsons.

Use of Prior Linkage Results in DevelopingAccurate m-Probabilities--Results of priorlinkages of similar files could be used to helpestimate more accurate m-probabilities forsubsequent linkages. New linkage softwareshould have a utility for estimating m-proba-bilities.

Estimation of Linkage Cutoff Weights inFellegi-Sunter Systems--Identifying the cutoffpoints for the linked and non linked decisionis one of the most important tasks in doing aIinkage using the Fellegi-Sunter method.These cutoffs depend on knowledge about theaccuracy of variables and the frequency ofoccu rrence of values of those variables.Current technology does not allow preciseestimation of these cutoffs; they are usuallyestimated by performing a pass of a linkageand examining the results. The pass can thenbe rerun with the cutoffs adjusted as needed.These cutoffs will vary from file to file, andthe cutoff points should be reestimated foreach new pair of files to be linked. (It maybe possible, when several linkages are per-formed with files from the same sources touse a predetermined set of cutoffs withoutlosing too much efficiency.) Any new re-cord linkage software must accomplish thetask of estimating these cutoff points in anefficient, user-friendly fashion.

File Size ConstraintsILimitations-- The soft-ware must be able to handle files of 250,000(or preferably more) NASS list framerecords.

Data HandlingThis section covers issues regarding differenttypes of data that will be used in recordlinkage and the requirements for handlingthose data.

Handling of Names--Many issues must bedealt with when handling names. NASSreceives list sources with names in manyformats. There is the usual split betweensources with names in signature format(Charles Day) and surname on the left (Day,Charles). In addition, there are businessnames and names of partners. Some of these

13

may appear in unusual formats. For exam-ple, for a few states NASS receives listsfrom FSA with the name in the format"Smith/Jones/Williams jt vent." These nameforms are mixed with others in more conven-tional formats. Name standardization soft-ware for use in NASS must be powerful andcustomizable so that unusual name forms arehandled properly. Not the least of these isthe format (and possible variations in theformat) of the name. In addition to format,there are the usual problems of nicknames,multiple names, initials only, titles (bothpresent and omitted), suffixes (such as Jr.),"care of's," abbreviations, and misspellings.Standardization software must overcome allbut the last of these challenges. Misspellingis usually handled either by phonetic codingor by a string comparison function of somekind.

Handling Different Kinds of Operating Ar-rangements--Different types of operatingarrangements require different procedures.Linkage software needs to be able to treatbusiness and partnership names differentlythan individual names, either by using multi-ple passes, or by providing for the use ofdifferent kinds of name comparisons in asingle pass.

Addresses--Addresses present a similar prob-lem to names. Again, the same address maybe presented in different formats. It maycontain abbreviations, building names, roomnumbers, names of intersecting streets, ruralroutes, highway contract routes, box num-bers, apartment numbers, abbreviations,inconsistencies (e.g., Sawyer, VA whenthere is no place named Sawyer in Virginia)and misspellings. This will be complicatedfor the next few years by the planned conver-sion of rural route addresses to locatable

("911 ") addresses. The same considerationswith respect to the power to discriminatebetween different sampling units with similaraddresses while retaining the ability to asso-ciate records representing the same unit thathave dissimilar addresses apply here as withnames.

Verification of Place Names and ZIP Codes--The current record linkage system uses adictionary of place names with a range oflegal 5-digit ZIP Codes to verify place namesand ZIP Codes. Some system for accom-plishing this task needs to be part of the newrecord linkage system. This may be accom-plished using an ELMO utility.

Assignment of Longitude and Latitude-- Thecurrent record linkage subsystem assigns alongitude and latitude to each farm recordbased on the city in the address and thencalculates distance between locations basedon this information for linkage purposes.New linkage software should use such ameasure of geographic "nearness" in thelinking process.

Other Data--When other identifiers (such asSSN. EIN, telephone number, or controldata) are available, a linkage system shoulduse them. Linkage software should be flexi-ble enough to use whatever combination ofcommon identifiers is found on two files.

Post-linkage FunctionsThis section covers the functions needed afterrecord linkage itself has been performed.

ResoLution of Record Linkage Outputn Thecurrent RECLSS generates printed outputand requires batch input of corrections. Anynew record linkage system or modification ofthe RECLSS should provide an online (ide-

14

ally graphical) interface for resolution of"possible" links. The resolution system musthave sufficient functionality to do all of thetasks that are currently handled with theresolution printouts, including recordingcomments, splitting the review betweenmultiple reviewers, and keeping track ofcalls to farmers. An online resolution capa-bility should be available even where thelinkage itself is performed in batch mode.The option of generating paper output shouldbe available.

Generation of Repons and Extract FiLes--Record linkage software should includeutilities for the generation of reports and thecreation of files of linked and unlinked re-cords. Reports can be used to evaluate theresults of changes in linkage specifications,including the application of different block-ing schemes, the employment of differentlinking variables, the use of multiple passes,and the setting of different cutoff weights.The software also needs to be able to extractfiles of linked and unlinked records in aformat that makes them easy to use to updatethe ELMO database. The software shouldprovide a utility for doing this.

AppLy Database Integrity Standards to Nev.'List Sources--Reformatting of new listsources should be separate from the linkagestep. Updating the database with new re-cords or updating existing records will bedone using ELMO utilities that enforce theactive data dictionary in order to ensuredatabase integrity.

Reponing of Statistics on the LinkingProcess--Record linkage software mustreport relevant statistics at each stage of thelinking process. Ideally, the software shouldalso allow estimation of linkage error rates.

REVIEWS OF SOFTWAREPACKAGES

This section presents reviews of the sixsoftware packages. Figure 3 summarizesthese reviews.

AUTOMATCHIAUTOSTAN (Match WareTechnologies)GeneralnAUTOMATCH is a generalizedrecord linkage system, applicable to a varietyof uses. AUTOST AN, its companion pro-gram, is a highly customizable system for thestandardization of name and address informa-tion. The methodology employed in AUTO-MATCH is based on the popular Fellegi-Sunter theory. None of the techniques usedin the system is considered proprietary, thus,the system meets the minimum standards forstatistical defensibility. Costs for AUTO-MATCH vary, from $3,000 for a singlecopy for a DOS-based personal computer to$9,995 for a license for a UNIX server. (Allprices are as of August 1995.) Completedocumentation is provided with AUTO-MATCH, as is a one or two day trainingperiod for one user. One year of mainte-nance and support (including upgrades to anynew versions issued during the year) is in-cluded in the price. Additional years ofsupport are available at a reasonable fee($600 a year for a DOS copy). One concernwith this package is that it is the product of asmall firm that is highly dependent on itsowner. If the agency adopted AUTO-MA TCH as its record linkage solution, asoftware escrow, in which MatchWare agreesto allow NASS access to source code shouldMatchWare cease operations, would need tobe arranged, to protect the agency against thecompany's ceasing operations. It is possible,although unlikely, that the agency might have

15

to do without outside support for the software.

Hardware and Software Environment--Afinal decision on a hardware platform forELMO has not been made yet; a dedicatedminicomputer at the Lockheed Martin facilityin Orlando, Florida, now appears the mostlikely alternative. AUTOMATCH is avail-able to run on a number of hardware andsoftware platforms, including: IBM-compati-ble desktops running MS-DOS, OS/2, andWindows NT; minicomputers and work-stations running UNIX; and custom main-frame computer versions. The software isalso available as Windows dynamic linklibraries (DLLs) and UNIX callable librariesfor construction of interactive systems.Callable libraries for a UNIX server cost$5,000 in addition to the batch price quotedearlier. ELMO will be implemented usingSybase. Sybase Corporation itself hassuccessfully used AUTOMA TCH in aninteractive mode with Sybase. An ASCII filewould need to be extracted to run in a batchmode.

Record Linkage Methodology--AUTO-MATCH is based on the Fellegi-Sunterrecord linkage theory. The system is de-signed for linking in multiple passes, withthe unlinked records from each pass proceed-ing to the next pass. Passes may vary onblocking variables or linking variables.Allowing multiple passes with the samelinking variables, but different blockingvariables, is valuable, since it reduces missedlinkages due to errors in the blocking vari-ables. (For example, if the probability oferror in two blocking variables were inde-pendent, and were 1% for each variable ineach file, then the expected percentage ofmissed linkages from blocking would be

Figure 3.--Comparison of Selected Features of Record Linkage Software Packages

MergelMergel PurgePurge 4.3

AUTO- 5SA- Smart Plus (Postal-Feature MATCH GRLS Name3 PID (Group 1) soft)

Nego-Cost ($US as of 8/95) $9.995 $21.650 $33,000 tiable $10,000 $20,000

-Gener- Gener - Com- Com- Direct Direct

Type of Package alized alized ponents ponents Mail Mail

Runs under UNIX X X X X X X-

1Compatible with Sybase X X X X X

Interactive capability X2 X X X X X

Uses probabilistic link-ing methodology X X X

Choice of methods forcomparing variables X X3 X4

Standardizes names X X X X5 X6

Standardizes addresses X X5 X7

Verifies ZIP Codes X8 X7-

Interactive "possible10 10

link" review X X

X9 ]0 10Generates reports X X X

Extracts files of linkedX9 ]0 10

data X X XIGRLS requires the files to be linked to be in an ORACLE database. 2 AUTOMATCH is capable ofinteractive operation with the purchase of callable libraries at an additional cost of $5,000. 3In addition toa rich library of standard comparison types, custom types can be programmed in C. 4Standard routinesoffer no choice of comparison types, but custom types can be programmed in C. 5Name and addressstandardization requires the List-Conversion package at an additional cost of $7,500. 6Name standardizationrequires the True-N ame package at an additional cost of $20,000. 7Address standardization and ZIP Codeverification require the Code-l package at an additional cost of $20,000. 8ZIP Code verification requiresthe Address Correction and Encoding (ACE) package at an additional cost of $42,000. 9Repon generationand data extraction are accomplished through the use of the ORACLE database package. lCYfhesefunctionswould be accomplished by a system built around the SSA-Name3 or Smart PID modules.

16

reduced from 1.99 % for one pass to0.0396% in the two-pass case.)

AUTOMA TCH can handle numeric identifi-ers and name and address variables. Theuser may use these variables in any com-bination. The user may specify the treatmentof missing values. Since the user defines thevariables as he wishes, numeric identifiers,like phone numbers and SSNs (which havenonrandom components, see Jabine [12]) canbe partitioned and the parts treated as sepa-rate variables. With AUTOSTAN, name andaddress variables can be partitioned into theirseparate parts as well. Information about theresolution of previous passes can be appliedto subsequent passes automatically, so thereis no need for a user to resolve the samepossibly linked or duplicate pair more thanonce during a particular linkage. However,no provision exists for carrying this informa-tion from one linkage to another. (Forexample, it would be quite useful in und-uplicating a file that is checked annually tobe able to apply the results of previouslinkage determinations to pairs of unchangedrecords which appear year after year aspossible links. AUTOMATCH has no suchcapability. )

AUTOMATCH offers many options forcomparing different kinds of variables. Forcomparing name variables, AUTOMATCHoffers both an information-theoretic stringcomparison function and support for the twomost popular phonetic coding systems,Soundex and a modified version of NYSIIS.In addition to a straight character for charac-ter comparison, AUTOMA TCH offers manyspecialized types of comparison. Theseinclude allowing an absolute number ofcharacters (digits) to be different, allowingfor percentage differences, comparing dates

17

and times, and a function for comparingdistances using latitude and longitude.

AUTOMA TCH calculates u-probabilitiesfrom the frequency of occurrence of differentvalues for each linking variable from the filesbeing linked. AUTOMA TCH also has autility for calculating m-probabilities fromcompleted passes. This utility could be usedto estimate starting m-probabilities forsimilar linkages from previous years' results.With this utility, a user can iterativelyestimate m-probabilities from current link-ages. AUTOMATCH also allows the user toproduce a histogram of linkage weights toguide estimation of weight cutoffs. Perhapsmore useful is the report generation feature,which allows custom report specification andspecification of reports including only pairswith weights in a particular range. Examina-tion of these reports is useful in setting cutoffweights. In several test projects, includingone for Texas, which has NASS's largest listframe, no difficulties were encounteredhandling files of any size.

Data Handling--AUTOSTAN, AUTO-MATCH's companion software for name andaddress standardization, is a powerful,customizable system for standardizing thesefields. AUTOSTAN is used by associating adata field, such as primary name, with a"process." Each process consists of threefiles. The first is a "dictionary" whichspecifies the format of the output record; thatis, it specifies the variables into which thefield being standardized will be parsed. Thesecond file, called a "classification" file,consists of a list of words (like surnames ortitles or street types), each of which isassociated with a particular class or type.The third file is a "pattern" file. This fileconsists of patterns of the various classes of

variables in the classification file, along withsome generic classes, and rules for parsingeach pattern into one or more of the outputfields specified in the dictionary file.AUTOST AN comes with some standardprocesses; users can use these processes asthey are, customize them, or create their ownprocesses from scratch. Unfortunately, noprocess that conforms to U. S. Postal Service(USPS) addressing standards is included(although a user could create such a process).Because no such pattern has been certified bythe postal service, there is some questionabout the availability of postal discountsusing AUTOST AN -standardized addresses.

AUTOST AN is capable of handling the typesof name and address problems encountered inNASS. In addition, it is possible to createseparate fields for handling different types ofoperating arrangements. One function whichAUTOST AN does not perform is the verifi-cation of place names and ZIP Codes andaddition of latitudes and longitudes to re-cords. If NASS chooses to use AUTO-MATCH, then the agency must write orpurchase additional software to perform thisfunction. The Group 1 and Postalsoftproducts reviewed later are capable ofperforming this function.

Post-Linkage Functions--AUTOMATCH is a"mixed bag" when it comes to performingpost-linkage functions such as possible linkresolution, report generation, and data fileextraction. Possible link resolution is donewith the clerical review program. Thisprogram allows review of possible links andduplicate pairs. Actions can be taken tomake the possible link pair a link, a nonlink,or leave it for later review, when moreinformation can be obtained. Additionalactions that swap a "duplicate" record for the

18

linked record from the same file and allowthe user to move backward and forwardwithin the file of possible links are available.This utility falls short of NASS I S needs insome critical areas. First, only one user at atime can work on the file of possible links.Second, there is no provision for the creationof hypertext comments in the same way thata person working with a printout could makeremarks on the printout. Third, there is noaccess to the data records on the database.There is no provision for accessing commentfields on the database at the time of resolu-tion. There is also no provision for resolvingsome records for which a decision on theirstatus is easy. and returning later, after moreinformation has been gathered, to resolveothers. Finally, the interface is characterbased and requires some skill in specifyingthe report format to get a useful format forpossible link resolution. (The software forreviewing possible links uses the customreport format. but it displays these data fieldsas three 80-character lines. Care has to betaken to design a format that does not breakfields. The report generation utility uses thisformat as well. which is inconvenient.)

As mentioned above, AUTOMATCH has areport generation function which allows theuser to specify custom report formats. Thesereports are generated as ASCII text files,with printer control characters embedded.One feature not included is the ability to adda line of space between groups of linkedrecords. Besides the report generationfunction, AUTOMA TCH also has a fileextraction function. It is possible to specifyan extract of any combination of variablesfrom either File A or File B for linkedrecords and for unlinked records. Theseextracts are necessary for updating ELMO.An ELMO utility would be used for actually

adding records to the database, or for elimi-nating duplicate records. AUTOMATCHalso can generate a number of different levelsof statistics about the linking process. Theuser may specify as little or as much infor-mation about each pass as he or she wishes[13] .

Generalized Record Linkage System (GRLS)(Statistics Canada)General--GRLS is also a generalized recordlinkage system, applicable to many uses.Statistics Canada developed the GRLS systemusing ORACLE and custom C languageroutines, creating a powerful and highlycustomizable record linkage system. Thispower comes at a price: GRLS requiresORACLE database software and the UNIXoperating system package to run. WhileELMO remains a Sybase database, thislimitation all but prohibits the use of GRLSin NASS, as it would require the purchase oftwo enterprise database packages, and a portof the data from one system to another eachtime record linkage was run. UnlikeAUTOMATCH, which has companion nameand address standardization software, GRLSdoes not offer this capability. Like AUTO-MATCH, GRLS is based on the statisticallydefensible Fellegi-Sunter record linkagemethodology.

As of August, 1995, the cost for a sitelicense for GRLS is Can$30,OOO (aboutUS$21,650 at current exchange rates). Thisprice includes installation, technical support,and one week of training. This price repre-sents a significant increase over earlierprices, and Statistics Canada suggests thatmost of this increase represents funding forproduct improvements and for better techni-cal support. Statistics Canada is committedto GRLS as its official, in-house record

19

linkage solution, so there is little need forconcern that the product will be unsupportedin the future.

Hardware and Software Environment--Asmentioned earlier, users of GRLS are limitedto UNIX workstation and minicomputerenvironments. In addition, ORACLE and aC compiler must be present, and the files tobe linked must be in an ORACLE database.Interactive linkage is a natural mode ofoperation for GRLS. Linkages can, ofcourse, also be performed in batch mode.

Record Linkage Methodology--GRLS is aFellegi-Sunter system. It performs multiplepass linkages. Like AUTOMA TCH, passesmay vary on blocking variables or linkingvariables. GRLS can handle both name andaddress and numeric identifiers. It is possi-ble to specify the weighting treatment ofmissing variables. The user can partitionvariables in any way needed. For example,whether a phone nwnber is treated as a singlenwnber or whether area code, exchange, andrandom digits are treated as different vari-ables depends upon how the user definesthem in his or her ORACLE database.GRLS assumes that the name and addressvariables have been partitioned beforeloading them into the database. GRLS hasno independent facility for parsing names andaddresses. Possible matches are resolved atthe end of the linkage process rather thanbetween passes. Unlike AUTOMATCH,GRLS uses the capabilities of ORACLE tomark record pairs so that the system candetect and automatically resolve possiblematches which have been resolved in aprevious linkage or unduplication effort.This eliminates the need to resolve the samepairs time after time. Also, GRLS allows theuser to write the linkage rules in a more

English-like format than AUTOMATCH,which uses a parameter file approach. BothGRLS and AUTOMATCH allow the user toinclude blocking variables as matchingvariables so that weights properly reflectagreement on common or rare values of theblocking variable.

Many different comparison options fordifferent kinds of variables are available inGRLS. These include an improved versionof the string comparison metric used inAUTOMA TCH (by setting options, thisfunction can duplicate AUTOMATCH'scomparison function). In addition, customcomparison functions can be programmed bythe user in C and linked to the GRLS system.This is a powerful capability, as it allows theuser to build any special knowledge he or shemay have about a particular variable into thecomparison function used for it. Also, itallows the sophisticated user to do any kindof comparison; he or she is not limited to thecomparison types contained in the package.

GRLS uses an Excel spreadsheet to estimateu-probabilities from the relative frequenciesof the values of the fields. M-probabilitieswould have to be estimated iteratively froma sample of the linked pairs. Like AUTO-MATCH, GRLS can produce a histogram oflinkage weights for linked and unlinkedpairs; however, Statistics Canada recom-mends using the iterative method for settingcutoff weights. Since GRLS has all of thereport generation capabilities of ORACLE atits command, it is easy to create reports forpurposes of evaluating cutoff weights.GRLS can handle files of any size thatORACLE can handle.

Data Handling--GRLS requires that data becontained in an ORACLE database. GRLS

20

does not have a powerful name and addressparsing capability built in. The assumptionis that the variables are parsed into theircomponent parts when they are entered intothe database. Some standardization may beaccomplished using the Automated Codingby Text Recognition (ACTR) facility. Thissystem allows the user to define standardcodes or abbreviations for descriptive phrasesor words. The lack of powerful name andaddress parsing is a serious limitation forNASS: many of NASS's lists have free-formatted name and address fields whichrequire parsing. Other than this limitation,GRLS can perform any sort of manipulationon the data that can be done with ORACLE.

Post-Linkage Functions--GRLS shines in itsimpressive set of post-linkage capabilities.Again, the pO\ver of the ORACLE databasepackage give~ GRLS its advantage. GRLScan make any kind of extract or create anykind of report that can be created usingORACLE, giving the users virtually unlim-ited options. On top of this, the softwareincludes a Graphical User Interface (GUI)based utility for resolving possible matches.Multiple users can resolve records at onetime, and comments from the database can beaccessed at the time of possible link resolu-tion [14].

SSA-Name3 and Extensions (Search Soft-ware America (SSA))Unlike AUTO MATCH and GRLS, which aregeneralized systems, ready to performlinkages with a minimum of installation andsetup time, SSA-Name3 and Extensions aresets of software components for building acustom record linkage system. This is adifferent paradigm, and assumes that thedevelopers and users of a record linkagesystem have special needs, not met by the

generalized programs, which justify theexpenditure of development resources.Companies with large databases, such asFederal Express, Visa, GEICO, and AT&T,use SSA-Name3 to generate keys for efficientdatabase searches. The Extensions productincludes routines for "scoring" agreements toperform matches. Costs for SSA-Name3vary; as of August, 1995, a license for asmall UNIX box with 45 concurrent users is$24,000 for SSA-Name3 and $9,000 for theExtensions product. The price for SSA-Name3 for a high end UNIX system is$46, 000. The purchase price includesdocumentation and one year of unlimitedtelephone technical support. Search SoftwareAmerica is a subsidiary of SPL WorId GroupSoftware, which has 550 employees world-wide.

Hardware and Software Environment--SSAwill create a version of SSA-Name3 to runon any platform. A developer can createeither a batch system or an interactive systemfrom the software modules. Depending onthe type of system constructed, the linkagecould be performed directly against thedatabase.

Record Linkage Methodology--SSA-Name3does not use the Fellegi-Sunter recordlinkage methodology. In effect, SSA treatsthe record linkage problem like a databasesearch problem. The software uses com-pressed, fixed-length 5-byte keys, based onan enhanced NYSIIS name coding system, toefficiently locate potential matches. Aftercleaning, reformatting, and standardizing thename, SSA-Name3 computes statistics for thefrequency of occurrence of standardizedwords. Based on the frequency of thesewords, SSA-Name3 builds keys. Each keycan contain information from up to four

21

words in the name. It then isolates the mostproductive key for search purposes. It thenestablishes "search sets and ranges" based onthis key. Multiple keys are used to allowfor differing word sequences in the name.Using these keys, a system can be built thatestablishes candidate sets for linkage.

The Extensions package provides comparisontools that allow the members of the candidatesets to be "scored" on a scale of 0 to 100 todetermine whether two records fall into thematch, suspect match, or no match catego-ries. The developer can decide on cutoffs forthese categories or allow a user to set thecutoffs. The scores are calculated accordingto user-defined schemes for combining theresults of comparisons between identifierssuch as names, addresses, phone numbers,and other numeric identifiers. The use of thecomparison tools does not provide a statisti-cal basis for comparing the records. Vari-abIes may be partitioned as needed. Thesystem is not inherently a multiple passsystem. Any capability for the system to useprevious linkages' information in resolvingcurrent possible links would have to be builtinto the system surrounding the SSA-Name3and Extensions modules.

Data Handling--SSA-Name3 accepts up to256 characters in a free-formatted name fieldas input. SSA parses this name into at mosteight words made up from a limited characterset. The character table can be customized.The name is then formatted by processing itusing a rule base of up to 64,000 rules whichcan be applied to improve recognition ofcommon variations by "[removing] noisewords, concatenat[ing] or attach[ing] prefixesand suffixes, replac[ing] abbreviations andnicknames, translat[ing] logical equivalents,etc. [15]." The rules can also be customized.

SSA does not provide address standardiza-tion.

Post -Linkage Functions--SSA does notprovide routines for doing post-linkagefunctions. These are the developer's respon-sibility. This allows maximum flexibility,but also means that development resourceswill be required to build the post-linkagefunctions into the system [16].

Smart PID (Advanced LinkageTechnologies of America (ALTA»ALTA is owned and operated by JerryWeber and Max Arellano. The companyfocuses on providing record linkage functionsto be integrated into hospital managementinformation system (MIS) applications. LikeSSA-Name3, Smart PID is a collection ofsoftware modules for the construction ofrecord linkage systems. These modules arewrinen in Pascal and C, and, while originallydesigned for use in an IBM MVS environ-ment. have been successfully recompiled torun on other machines, including UNIXmachines. Visual Basic has been usedsuccessfully in a Windows environment tocreate systems around the Smart PID mod-ules. The methodology employed in SmartPID is based on the Fellegi-Sunter recordlinkage theory. Costs for Smart PID arenegotiable, and are based on the number ofrecords in the system. The costs are usuallyspread over several years, but a single up-front payment can be negotiated (based onthe present-value of the yearly costs).Documentation is provided with Smart PID,as are a couple of days of Max Arellano'stime to help with installation and training.The purchase price includes telephonesupport for development and use of thesoftware. The same concem that exists withMatchWare exists with this company. It is a

22

very small firm, and a software escrowwould need to be negotiated to protect NASSagainst the firm's ceasing operations.

Hardware and Software Environrnent-- TheSmart PID modules can be recompiled to runin many environments. The software hassuccessfully been used with a Sybase data-base. One drawback of this software forNASS purposes is that ALTA designed it foruse almost exclusively in an interactivemode. While batch mode operation may bepossible, it i5 not how the software is de-signed to be used.

Record Linkage Methodology--Smart PID isbased on the Fellegi-Sunter record linkagetheory, along with some enhancements thatimprove performance in the medical recordlook-up setting. The system is designed fora minimum of user intervention. This can bevaluable in a setting where the user is un-likely to have expertise either in recordlinkage or in the use of Smart PID. Thesystem attempts a match using up to fourblocking strategies. The first is blocking onSSN (ideally, this should yield uniquerecords); the second is blocking on theSoundex code of the first name and thebirthday; the third is blocking on NYSIIScode of the last name. (The third scheme isequivalent to the NASS RECLSS.) There isalso the capability for a system developer tospecify a fourth, custom blocking strategy.

As a part of the strategy of allowing the userlimited options. Smart PID allows the use ofthe following linking variables: Name,Address, Telephone Number, SSN/EIN, andbirth date. It is possible to add other identifi-ers. The user also chooses the matched,unmatched, and possibly matched cutoffweights. Smart PID can parse and standard-

ize a name field using look-up tables that theuser can customize. Address "standardiza-tion" is rudimentary, consisting of removinga substring from the beginning of the addressfield. Smart PID differs from the otherFellegi-Sunter systems discussed in thisreport in the use of two additional codes(besides the Fellegi-Sunter weight) to decideif a record pair is a match. These are thevalidity code, used to decide whether specialcircumstances exist which might invalidate ahigh Fellegi-Sunter weight, and the confir-matory code, used to help decide whether alow-weight possible match should be consid-ered a match. For example, ALTA has usedthe validity code in cases where an individ-ual has used his or her spouse's SSN. Amatch on an accurate, unique identifier suchas SSN, combined with an identical address,surname, and phone number would normallyyield a high Fellegi-Sunter weight, resultingin a match; however, use of the first name togenerate a validity code would catch thisfalse match. ALTA uses the confirmatorycode in cases of sparse information, wherethe Fellegi-Sunter weight might be too low tofind a match because of a large number ofmissing variables. Address, SSN, and phonenumber are generally used in forming thiscode. Both near-identical records that arenot matches and sparse records can causeproblems in NASS's files. Any provisionfor using information from previous linkagesto resolve a current linkage would have to bemade using the system in which the SmartPID modules were embedded. Smart PIDdoes not offer optional methods for compar-ing identifiers. Smart PID calculates u-probabilities using frequency counts and m-probabilities based on empirical estimates ofthe accuracy of the variables. Any limita-tions on file size would come from theembedding system, not from the Smart PID

23

modules. (Two hundred fifty thousandrecords is not a large file for Smart PID.)

Data Handling--Smart PID can standardizeand parse name fields based on a customiz-able look-up table. No real address standard-ization is available. A concern with theSmart PID system is the limited number ofvariables that it uses. Also of concern is itsreliance on birth date, which is not availablein NASS's list frame files, and on SSN,which is not available for some records.

Post-Linkage Functions--Possible links aredisplayed in order by their Fellegi-Sunterweights, along with their validity and confir-matory codes. The system that is to bedeveloped around the Smart PID moduleshandles resolution of possible links andcreation of extract files and reports. Usershave successfully used nonprogrammabledatabases, like dBase, to download theresults of Smart PID comparisons andproduce reports [17].

Merge/Purge Plus and Code-l (Group 1Software)The Merge/Purge Plus and Code-l productsfrom Group 1 Software represent yet anothertype of record linkage software. Rather thanbeing components from which a recordlinkage system is built or generalized recordlinkage packages, these packages have beendeveloped for the specific purpose of un-duplicating mailing lists and standardizingaddresses for direct mail marketing andsubscription organizations. Group 1 claimsan installed base of over 3,000 users. As ofAugust, 1995, on the GSA schedule,Merge/Purge Plus costs $10,000 for a singleUNIX box (any number of users); its com-panion product for standardization forlinkage purposes, List-Conversion, costs

$7,500. Code-I, the postal name and ad-dress standardization software costs $21,000.These prices include: installation; training atone of Group l' s eight training centers (oneis in Washington, D.C.); maintenance; and,for Code-I, a new copy of its database ofdeliverable U. S. addresses every threemonths for two years. The price of thesoftware includes unlimited telephone hotlinesupport.

Hardware and Software Environment--Merge/Purge Plus runs in many UNIXenvironments, including Sun Sparc, HewlettPackard, NCR, AT&T, and IBM AIX. It isalso available in a Windows NT version.The software can operate either as an on-linesystem or in a batch mode.

Record Linkage Methodology--Merge/PurgePlus is not a Fellegi-Sunter system. It usesa simple approach, comparing the linkingvariables specified by the user using acharacter-for-character approach. Loose,tight, or medium agreements can be specifiedby the user. The system then uses counts ofagreements and disagreements and weightsprovided by the user to compute an overallweight. Priorities can also be set so thatrecords can match on a single variable orcombination of a few variables. The systemmakes a single, unblocked pass through thedata, relying on sorting the actual data file(or, at least, a file of match variables and anidentifier) to aid efficiency.

Merge/Purge Plus can use a combination ofname, address, and numeric variables formatching. List-Conversion parses andstandardizes names and addresses and pro-duces a file similar to that produced byAUTOST AN, with the original data ap-pended to the standardized name and address

24

data. There is no provision for using infor-mation from prior linkages in new linkages.There are no options for different methods ofcomparing variables.

Data Handling--List-Conversion parses andstandardizes free-formatted name and addresslines for matching purposes. The Code-lproduct reformats, corrects, and enhancesaddress information. It performs this func-tion bv matching the address against a- ~ ~database of all the deliverable addresses inthe United States. This is a reformattedversion of the USPS file. The system cannotreformat an address it cannot find in thedatabase [18]. The software meets USPScertification standards, and, unlike thepackages reviewed above, performs a verifi-cation of ZIP Code and place names.Merge/Purge Plus does not directly accessdatabase products.

Post-Linkage FunctionsnMerge/Purge Pluscan extract many different types of files,including any files NASS would need. Inaddition, many different kinds of linkagereports are available. Since Merge/PurgePlus does not produce possible matches, nopossible match resolution software is needed.It is possihle to list record pairs with acombined weight (overall weight for allvariables) above a certain level [19].

Merge/Purge 4.3, Address Correction andEncoding (ACE), and True Name (Postal-soft)Like Merge/Purge Plus, Merge/Purge 4.3has been developed for the purpose ofunduplicating mailing lists and standardizingaddresses for direct mail marketing. Postal-soft has an insr.alledbase of about 500 copiesof Merge/Purge 4.3. The company hasapproximately 160 employees. As of Au-

gust, 1995, Merge/Purge 4.3 costs approxi-mately $20,000-$25,000 for installation on aUNIX system with lists about the size ofNASS's list frame (for the whole UnitedStates). Its companion software for addressstandardization purposes, ACE, costs$42,000, and Postalsoft's name standardiza-tion software, True Name, costs $20,000.These prices include documentation and 90days of technical support and updates. Oneyear of extended support and updated ver-sions costs 16.5 % of the purchase price ofthe product. Postalsoft generally produces 2-3 updates per year. Updated databases ofdeliverable addresses for ACE cost $4,500-$6,000 per year.

Hardware and Software Environment--Merge/Purge 4.3 runs in DOS, UNIX, andIBM mainframe operating system environ-ments. The software operates in a batchmode and exists as callable libraries forbuilding interactive systems. The packagecan access Sybase directly.

Record Linkage Methodology--Merge/Purge4.3 is not a Fellegi-Sunter system. Thesoftware compares as many as 35 variableson a pair of records. For each variable theoutcome of the comparison is a score be-tween 0 and 100. The user can set the cutofflevel that will be considered an agreement.Similarly, for the overall agreement score(also on a scale of 0-1(0) the user can set theminimum score for a pair to be considered aduplicate. Tools are available to output theraw scores. There is no statistical approachtaken to determining the likelihood of amatch.

Merge/Purge 4.3 uses a combination ofname, address, and numeric variables. ACEand True Name can parse address and name

25

information for matching. There is noprovision for using information from priorlinkages in new linkages. There are nooptional routines for different methods ofcomparing variables included with thesoftware, but Merge/Purge 4.3 Customallows the use of user-programmed (in C)custom comparison routines.

Data Handling--ACE and True Name parsefree-formatted name and address lines. ACEalso attempts to correct addresses and addZIP Codes besides putting addresses in USPSformat. ACE uses a database of deliverableaddress information to which it matches theincoming address and corrects and supple-ments the address if necessary. As is truefor Code-I, this product cannot parse anaddress it does not have on its database. Onthe other hand, it can probably flag undeliv-erable or nonexistent addresses. The soft-ware is USPS CASS certified. The packagehas been used successfully with a Sybasedatabase and in interactive mode.

Post-Linkage Functions--Merge/Purge 4.3can extract many different kinds of files andproduce many reports. The extraction andreport generation features are customizable.Since Merge/Purge 4.3 does not producepossible matches, no resolution software isneeded [20].

CONCLUSIONS

The packages reviewed in this report fall intothree categories. The fIrst, including AUTO-MATCH, and GRLS, contains generalizedrecord linkage packages. These packages areready to perform linkages as they come fromthe manufacturer. They are both powerfulpackages that give a systems developer oruser a flexible tool, allowing him or her

maximum flexibility in specifying howIinkages will be performed and how theresults will be displayed. As a generalizedsystem, they require the least additionalsoftware development resources to meetNASS's needs. Both employ the Fellegi-Sunter record linkage theory.

The next category contains packages termed"components" in this report. This categorycontains SSA-Name3 and Smart PID. Bothpackages require significant additionaldevelopment resources to build a systemaround the linking subroutines provided bythe manufacturer. In addition. both areaimed at specific target markets differentfrom government statistical agencies. SmartPID uses the Fellegi-Sunter method, alongwith some enhancements. There does notappear to be statistical justification (otherthan the justification for using NYSIIS inname searching) for the methods used in the"scoring" routines of SSA-Name3' s Exten-sions package.

The final category is that of mailing listmanagement software. Merge/Purge Plusand Merge/Purge 4.0 fall into this category.Both packages provide the advantage ofneeding little additional development tocreate a usable system, but neither has astatistical foundation for its matching meth-ods.

RECOMMENDA TIONS

NASS should adopt AUTOMATCH andAUTOST AN as its new record linkagesolution and should commit the resources toadding the functionality needed to meet all ofNASS's requirements. AUTOMATCH is theleast expensive of the packages. yet it offersthe most functionality for NASS applications.

26

It will require the fewest resources to meetNASS's specifications. It is based on theproven Fellegi -Sunter record linkage method-ology. No other package has a combinedname and address standardization capabilityfor record linkage purposes as powerful asthat offered by the AUTOSTAN software. Itis available on all of the platforms that NASSis considering for record linkage. Finally, itis available in a form that NASS can use inlater versions of ELMO to create an interac-tive record linkage capability.

Two alternatives exist to the adoption ofAUTOMA TCH. The first is to develop in-house, either from scratch or by conversionof mainframe legacy code, NASS's owncompletely proprietary record linkage sys-tem. This is an expensive and time-consum-ing alternative which would only be justifiedby a belief that such a system would signifi-cantly outperform AUTOMATCH. There isno evidence that an in-house system woulddo so. AUTOMA TCH has compared favor-ably to the current mainframe record linkagesystem in empirical tests [21]. While somemethodological improvements to the currentsystem might be possible, the return from thelarge investment of resources would likelynot be great. Indeed, most of the method-ological improvements would involve incor-poration of methods available in AUTO-MATCH or GRLS.

The second alternative is to choose anotherpackage that is commercially available as thebasis for a new system. The most likelyalternative package is Statistics Canada'sGeneralized Record Linkage System. It is acomplete generalized solution for linkage,which offers a great deal of power andflexibility because of its close tie to theORACLE database package. Alas, it is this

very strength that is GRLS' s undoing forNASS. NASS's ELMO is a Sybase data-base, and GRLS simply does not work withSybase. In addition, if GRLS were chosen ,a name and address standardization packagewould still be needed, since GRLS has nostandardization capability.

Smart PID is another Fellegi-Sunter alterna-tive, but it would require more developmentresources to build a working system to meetNASS's needs than AUTOMATCH. Inaddition, its linkage approach is limited withrespect to choice of variables, and dependson variables NASS does not have on its listframe files. Finally, it is more expensive.

SSA-Name3 and Extensions is also expensiveand does not offer the assurance of theFellegi-Sunter methodology. Again, itwould cost more to build into a workingsystem that met NASS's needs than AUTO-MATCH, with no assurance that the finalsystem would be as effective.

Both Merge/Purge Plus from Group 1 andMerge/Purge 4.3 from Postal soft are notsuitable to NASS because they rely on ad hocmethodologies. They are also expensive.They are best suited to their intended purposeof serving direct mail clients who do nothave the expertise in record linkage tech-niques to effectively use more sophisticatedpackages, and need a package tied into amailing list management system.

Taken together, these considerations all pointto AUTOMATCH as the most appropriatechoice as the core component for NASS'snext record linkage solution.

This recommendation is based on NASS'sexisting and planned applications and is

27

not in any manner a general recommenda-tion on record linkage software outside ofNASS.

REFERENCES

[I] Newcombe, H. B., Kennedy, J. M.,Axford, S. 1., and James, A. P."Automatic Linkage of Vital Rec-ords," Science, Vol. 130, pp. 954-959.

[2] Fellegi, Ivan P. and Sunter, Alan B.,"A Theory for Record Linkage,"Journal of the American StatisticalAssociation, Vol. 64, pp. 1183-1210.

[3] Federal Committee on StatisticalMethodology, Statistical PolicyWorldng Paper Number 5: Report onExact and Statistical Matching Tech-niques, U.S. Department of Com-merce, 1980, p. 1.

[4] As used in this report to refer tocurrent practice in NASS, the termrecord linkage system refers not onlyto the Record Linkage Subsystem(RECLSS) but also to the attendantreformatting, data manipulation,identical match, and resolution pro-grams necessary to complete the taskof linking two sets of records andproducing a single unduplicated file.For a description of this system, seeCoulter, Richard W., and Mer-gerson, James W., "An Adaptationof a Record Linkage Theory in Con-structing a List Sampling Frame,"Computing Science and Statistics:Proceedings of the Tenth AnnualSymposiwn on the Inteiface , National

Technical Information Service Spe-cial Publication Number 503, NTIS,1978, pp. 416-420.

[5] A good description of the NYSIISand Soundex phonetic coding systemscan be found in Appendix H of New-combe, Howard B., Handbook ofRecord Linkage, pp. 181-187, Ox-ford University Press. 1988.

[6] Tepping, Benjamin J .... A Model forOptimum Linkage of Records," Jour-nal of the American Statistical Asso-ciation, Vol. 63, pp. 1321-1332.

[7] Yu, C. T., Lam, K.. and Salton, G.,"Term Weighting In InformationRetrieval Using the Term PrecisionModel," Journal of the Associationfor Computing Machine!}.:, Vol. 29,pp. 152-170.

[8] Cooper, W. S. and Maron, M. E.,"Foundations of Probabilistic andUtility Theoretic Indexing," JACM,Vol. 25, pp. 67-80.

[9] Van Rijsbergen, C. J .. Harper, D.1., and Porter, M. F., "The Selectionof Good Search Terms," /nfonnationProcessing & Management, Vol. 17,pp. 77-91.

[10] Arends, William, "Selection of aSurname Coding Procedure for theSRS Record Linkage System," Sam-ple Survey Research Branch, Re-search Division, Statistical ReportingService (NASS), USDA, 1977.

[11] The state list frame statisticians whoparticipated were Joe Ross (WA),

28

Jerry Ramirez (VA), Dale Turnip-seed (TX), Charles Ruckman (KS),Aubrey Bordelon (FL), and RogerOn (Il.).

[12] Jabine. Thomas B., "Properties ofthe Social Security Number Relevantto Its Use in Record Linkage," Re-cord Linkage Techniques--1985, pp.213-225.

[13] This review of the AUTOMA TCHand AUTOST AN software packagesis based on the author's two year's ofextensive experience with the pack-ages, along with the experience ofKara Broadbent in the Ohio ResearchUnit.

[14] The evaluation of the GeneralizedRecord Linkage System is based onreviews of documentation and con-versations with Ted Hill and MarthaFair of Statistics Canada. The prod-uct reviewed will not be availableuntil December 1995.

[15] Personal communication from GeoffHollmvay of Search Software Amer-Ica.

[16] The review of SSA-Name3 and Ex-tensions is based on a review of doc-umentation, letters exchanged withGeoff Holloway of SSA, and conver-sations with Suyen Lyn of SSA.

[17] The review of Smart PID is based onan early DOS demo, review of docu-mentation, and conversations withJerry Weber of ALTA.

[18] Empirical tests at the Census Bureauindicated that the Code-I productperformed less well than their in-house address parsing software andAUTOSTAN. Note, however, thatthe latter two systems do not attemptaddress correction or ZIP coding, animportant part of Code-I 's capabili-ties for direct mailers.

[19] The review of Merge/Purge Plus andCode-I is based on review of docu-

29

mentation and conversations withTyrone Singh of Group 1.

[20] The review of Merge/Purge 4.0 andACE is based on review of documen-tation and conversations with EdSugg of Postal soft.

[21] Day, Charles, "Record Linkage II:Experience with AUTOMA TCH inNASS," Survey Technology Branch,National Agricultural Statistics Ser-vice, USDA, forthcoming.

APPENDIX A--GLOSSARY

The following is a short glossary of recordlinkage and related terminology used in thisreport. The first occurrence in the text ofeach word in the glossary is italicized.

Batch ModenA mode of operation in whichthe computer operates without input from theuser.

Blockingn The division of one or more fileswhich are to be linked into groups, orblocks, which agree in value for a set ofvariables (the blocking variables) with theintent that only record pairs within blockswith the same values will be compared.Blocking is done to reduce the number ofcomparisons to be made to a computationallytractable number.

Comparison Outcome--The result of compar-ing the values of the same linking variable oneach of the records in a record pair. Forexample. if the first name on each of therecords in a pair is John, then "first name isJohn and agrees" would be the outcome ofcomparing the two records on first name.

Duplication-- The presence in a single file ofmultiple records representing the same unit inthe population.

False link-- The association of two recordswhich do not represent the same unit in thepopulation. Sometimes loosely called anonmatch.

False nonlink-- The failure to associate tworecords which do, in fact, represent the sameunit in the population.

30

Fellegi-Sunter Theory--A theoretical ap-proach to the problem of probabilistic recordlinkage. Put forward in a 1969 JASA articleby Ivan Fellegi and Alan Sunter, the theoryshowed how to obtain the smallest number ofrecords to review manually (possible links)in order to achieve given levels of false linkand false nonlink using the probabilities ofobtaining various comparison outcomes giventhat the records being compared do or do notrepresent the same unit in the population.

Identifters-- Words or codes, such as names,addresses. phone numbers, or Social Securitynumbers which can be used to discriminatebetween units in a population.

Interactive ModenA mode of operation inwhich the user actively participates, in realtime, by entering data and responding toprompts from the system.

LinkuA link is said to occur if two recordsare associated by the record linkage methodbeing used. The inference is that theserecords represent the same unit in the popu-lation, although the inference may be false inthe event of a linkage error. Except whenspeaking strictly in the terminology used inthe Fellegi-Sunter theory, link is a synonymfor match.

Linkage Weight--In the Fellegi-Sunter the-ory, the log on the base two of the ratio ofthe m- and u-probabilities. Also used torefer to the sllm of these logs for all of thevariables involved in a particular linkage.

Linking Vanables--Identifiers which are usedin the process of linking records.

List Frame--In NASS, a list of farms andagricultural operations from which samplesare chosen for NASS's estimation programs.

m-probability-- The probability of an outcomeor vector of outcomes occurring given thatthe pair of records being compared belongsin the matched set.

MatchnA comparison pair whose recordsrepresent the same unit in the population.Except when speaking strictly in the termi-nology used in the Fellegi-Sunter theory, linkis a synonym for match.

Matched set-- The set of pairs of recordswhich represent the same unit in the popula-tion; that is, the set of records which shouldbe matched.

NonlinknA nonlink is said to occur if tworecords are not associated by the recordlinkage method being used. The inference isthat these records do not represent the sameunit in the population, although the inferencemay be false in the event of a linkage error.Except when speaking strictly in the termi-nology of the Fellegi-Sunter theory, nonlinkis a synonym for nonmatch.

Nonmatch--A comparison pair whose recordsdo not represent the same unit in the popula-tion. Except when speaking strictly in theterminology used in the Fellegi-Sunter the-ory, nonlink is a synonym for nonmatch.

NYSIIS Code--A phonetic coding system usedfor overcoming spelling errors in alphabeticidentifiers. (NYSIIS stands for New YorkState Intelligence Information System.)

Parsingn The process of dividing a name oraddress into its component parts.

31

Possible LinknA possible link is said tooccur if no positive decision on the linkstatus of a record pair can be made by therecord linkage method being used.

Record LinkagenThe association of recordsrepresenting the same unit from one or morefiles representing the same population bycomparing identifiers.

Record Pair--Two records. one from each oftwo files being linked.

SoundexnA phonetic coding system used forovercoming spelling errors in alphabeticidentifiers.

Standardizationn The process of substitutingstandard abbreviations for words in a nameor address and of putting the name or addressin a standard format. Sometimes also usedto include the process of parsing the name oraddress.

Statistical DefensibilitynNASS policy states,"An estimate ... is statistically defensiblewhen ... the estimate is the product of awell-documented estimation strategy that isbased on reasonable and clearly articulatedassumptions. " It further states that, " ...estimation strategy includes ... determiningthe frame or frames," and "Where possible,unnecessary assumptions should be avoided. "In a record linkage context, this means thatthe methodology used should be clear andshould be grounded in statistical theoryrather than ad hoc assumptions.

String Comparison Function--An arithmeticfunction that produces an agreement weightfor the comparison between two strings ofalphabetic characters based on the differencesand similarities between the strings.

u-probability-- The probability of an outcomeor vector of outcomes occurring given thatthe pair of records being compared belongsin the unmatched set.

32

Unmatched setn The set of pairs of recordswhich do not represent the same unit in thepopulation: that is, the set of records whichshould not he matched.

APPENDIX B--A CHECKLIST 1.5 Can the software run under theFOR EV ALVA TING following operating systems:

RECORD LINKAGE SOFTWARE MS/PC DOS?

General OS/2?

1.1 Is the software a generalized systemWindows 3. 1/95?

or specific to a given application? Windows NT?

1.2 Is the software a:UNIX?

Complete system, ready to performVMS?

linkages "out of the box?" Mac OS?

Set of components, requiring that a Novell NetWare?system be built around them? If so, Mainframe OS (e.g. IBM MVS)?how complete are the components?

Part of a larger system for perform- 1.6 For PC based systems, what level ofing integrated mailing list functions? processor is required? How much

memory? How much hard drive1.3 What types of linkages does the space?

software support?

Unduplication (one file linked to1.7 Can the system perform linkages

interactively (in real time)? Can ititself)? operate in batch mode?Linking two files?

Simultaneously linking multiple1.8 How fast is the software on the

user's hardware and files the size offiles? the user I s files? If the software isLinking one or more files to a interactive, is its performance ade-reference file (multiple-pass systems quate?only)?

1.9 If the software is to be used as part1.4 Can the software be used on the of a statistical estimation system,

following computers: are the methods used in the software

Mainframes?statistically defensible?

Mini-computer? 1.10 Is the cost of developing a system

Workstation?for the intended purposes using thesoftware within the available bud-

IBM -compati ble microcomputer? get?

Macintosh?

33

1.11 Is the vendor reliable'? Can the ven-dor provide adequate technical sup-port? Will they continue to existfor the projected life of the soft-ware? If this is in question, is asoftware escrow available? Is theuser prepared to support the soft-ware him/herself?

1.12 How well is the software doc-umented? Can a new user reason-ably be expected to sit down withthe manual and begin using the soft-ware, or will training be necessary?Does the vendor provide training?At what cost?

1. 13 What features does the vendor planto add in the near future (e. g., inthe next version)?

1. 14 Is there a user group? Who else isusing the software? Whae featureswould they like to see added? Havethey developed any custom solutions(e. g., front ends, comparison func-tions) they would be willing toshare?

Linkage Methodology

2.3

2.4

2.5

Does (he software require any pa-rameter files? If so, is there a util-ity provided for generating thesefiles? How effectively does it auto-mate tje process? Can the utility becustomized?

Does the user specify the linkingvariables and types of comparisons?

What kinds of comparison functionsare available for different types ofvariables? Do the methods giveproportional weights (that is, allowdegrees of agreement)?

Character -for -character?

Phonetic code comparison (Soundexor NYSIIS variant)?

Information theoretic stringcomparison function?

Specialized numeric comparisons?

Distance comparisons?

Time/Date comparisons?

Ad hoc methods (e.g., allowing oneor more characters different be-tween strings)?

2.1

2.2

What record linkage method is thesoftware based on?

Fellegi -Sunter?

Information-Theoretic methods?

How much control does the userhave over the linkage process? Isthe system a "black box," or can theuser set parameters to control thelinkage process?

34

2.6

2.7

Can the user specify critical vari-ables that must agree for a link totake place?

How does the system handle miss-ing values for linkagevariables?

Computes a weight like any othervalue?

Uses a median between agreementand disagreement weights?

Uses a zero weight? any utilities for determining weight

Allows user the option to specify cutoffs?

treatment? 3.3 Does the software allow linkage

2.8 Does the system allow array-valuedweights to be fixed by the user?

variables (e.g., multiple values for What about weights for missing val-

phone number)? How do array-values?

ued comparisons work? What is the Data Managementmaximum number of values in anarray? 4.1 In what file formats can the

2.9 What is the maximum number of software use data?

linking variables? Flat file?

2.10 How does the software block re- SAS Dataset?

cords? Do users set blocking vari- Database? If yes, what kind ofabies? Can a pass be blocked on database?more than one variable? Dbase?

2.11 Does the software support multiple Fox Pro?linkage passes with different block-ing and different linkage variables? Xbase?

2.12 Does the software contain or sup-Informix?

port routines for estimating linkage Sybase?

errors? ORACLE?

2.13 Does the matching algorithm use Other database package?

techniques that take advantage ofdependence between variables? 4.2 What is the maximum file size

FeIlegi-Sunter Systems(number of records) that the soft-ware can handle?

3.1 How does the system determine m-and u-probabilities? Can the user

4.3 How does the software manage re-

set m- and u-probabilities? Doescords? Does it use temporary data

the software provide utilities to setfiles or sorted files? Does it use

m- and u-probabilities.pointers?

3.2 How does the system determine4.4 Can the user specify subsets of the

weight cutoffs? Are they set by thedata files to be linked?

user? Does the software provide

35

.+.5

4.6

Does the software provide for "testmatches," of a few hundred recordsto test the specifications?

Does the software provide a utilityfor viewing and manipulating datarecords?

5.5

unlinked records? Can the userspecify the format of such extracts?

Does the software generate statisticsfor evaluating the linkage process?Can the user customize the statisticsgenerated by the system?

Post-linkage Functions Standardization

5.1

5.2

5.3

5.4

Does the software provide a utilityfor review of possible links? If so,what kind of functionality is pro-vided for? What kind of interfacedoes the utility use. character-basedor GUI? Does the utility allow forreview between passes. or only atthe end of the process? Can morethan one person work on the recordreview simultaneously? Can rec-ords be "put aside" for later review?Is there any provision for addingcomments to the reviewed recordpairs in the form of hypertext?

Does the software provide for re-sults of earlier linkages (particularlyreviews of possible links) to be ap-plied to the current linkage process?

Does the software provide a utilityfor generating reports on the linked,unlinked. duplicate. and possiblelink records? Can the report formatbe customized? Is the reportviewed in character mode, or is thereport review done in a graphicalenvironment? Can the report beprinted? If so, what kind of printeris required?

Does the software provide a utilityfor extracting files of linked and

36

6.1

6.2

6.3

6.4

6.5

6.6

6.7

Does the software provide a meansof standardizing (parsing out thepieces of) name and address fields?

Does the software allow for parti-tioning of variables to maximize theuse of the information contained inthese variables (for example, parti-tioning a phone number into areacode, exchange, and the last fourrandom digits)?

Can name and address standardiza-tion be customized? Can differentprocesses be used on different files?

Does address standardization meetU.S. Postal Service standards?

Does standardization change theoriginal data fields, or does it ap-pend standardized fields to the orig-inal data record?

How well do the standardizationroutines work on the types of namesthe user wishes to link?

Ho\', well do the standardizationroutines work on the addresses theuser will encounter? (E.g., howwell does it handle rural addresses?Foreign addresses?)

Documents

RecordLinkage I: Evaluation ofCommercially Available Record … · 2017. 2. 24. · the Fellegi-Sunter record linkage theory and sophisticated string comparison functions in its matching