10
American Journal of Epidemiology Copyright O 1997 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 145, No. 4 Printed In U.S.A Constructing Reproductive Histories by Linking Vital Records Melissa M. Adams, 1 Hoyt G. Wilson, 1 Dale L Casto, 1 Cynthia J. Berg, 1 Jeanne M. McDermott, 1 James A. Gaudino, 1 ' 2 ' 3 and Brian J. McCarthy 1 Certificates of 1,449,287 live births and fetal deaths filed in Georgia from 1980 through 1992 were linked to create chronologies that, excluding induced abortions and ectopic pregnancies, constituted the reproductive experience of individual women. The authors initially used a deterministic method (whereby linking rules were not based on probability theory) to link as many records as possible, knowing that some of the linkages would be incorrect. They subsequently used a probabilistic method (whereby evaluation of linkages was developed from probability theory) to evaluate each linkage, and they broke those that were judged to be incorrect. Of the 1.4 million records, 38% did not link to another record. From the remaining records, 369,686 chains of two or more events were constructed. The longest chain included 12 events. Of the chains, 69% included two events; 22% included three events. Longer chains tended to have lower scores for probable validity. The probability- based evaluation of chains affected 3.0% of the records that had been in chains at the end of the deterministic linkage. A greater percentage of records in longer chains were affected by the evaluation. Unfortunately, the small subset of records that were the most difficult to link tended to overrepresent groups with the greatest risk of adverse pregnancy outcomes. Researchers contemplating a similar linkage can anticipate that, for the majority of records, linkage can be accomplished with a relatively straightforward, deterministic approach. Am J Epidemiol 1997; 145:339-48. birth certificates; epidemiologic methods; fetal death; medical record linkage; reproductive history Population-based data on women's reproductive ex- periences over successive pregnancies have not been available in the United States until recently. Cross- sectional studies have provided most of our current understanding of the relations of outcomes across pregnancies. For example, analyses of cross-sectional data have led researchers to conclude that perinatal mortality increases with each subsequent birth (1). When a woman's total number of births is taken into Received for publication December 29, 1995, and accepted for publication September 27, 1996. From the World Health Organization Collaborating Center Insti- tutions: Emory University Regional Perinatal Center, Centers for Disease Control and Prevention, and Division of Public Health, State of Georgia 1 World Health Organization Collaborating Center in Perinatal Care and Health Services Research in Maternal Child Health, Divi- sion of Reproductive Health, Atlanta, GA. 2 Office of Perinatal Epidemiology, Epidemiology and Prevention Branch, Division of Public Health Department of Human Resources, State of Georgia, Atlanta, GA. 3 Epidemic Intelligence Service, Division of Field Epidemiology, Epidemiology Program Office, Centers for Disease Control and Prevention, Atlanta, GA. Reprint requests to Dr. Melissa M. Adams, World Health Orga- nization Collaborating Center in Perinatal Care and Health Services Research in Maternal Child Health, Division of Reproductive Health, MS K-23, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, Public Health Service, US Department of Health and Human Ser- vices, 4770 Buford Highway NE, Atlanta, GA 30341-3724. account, however, a different pattern emerges: The risk of perinatal mortality decreases with each subse- quent birth (1). Thus, cross-sectional studies can lead to false conclusions. In contrast, analyses of longitu- dinal data (i.e., data on an individual woman's succes- sive pregnancies) offer many advantages: They can better elucidate patterns across pregnancies, and they are not biased by selective fertility (2, 3). Finally, longitudinal data are important for evaluating the suc- cess of public health interventions, such as progress toward reducing the rate of repeated cesarean section. Compiling longitudinal data in the United States has been difficult because of the absence of unique iden- tifiers that facilitate linking records. This report de- scribes the methods used to link birth and fetal death certificates filed in Georgia from 1980 through 1992 and gives an overview of the results. MATERIALS AND METHODS We linked live birth and fetal death certificates into chronological chains of events that, excluding induced abortions and ectopic pregnancies, constituted the re- productive experience of individual women. Each baby or fetus corresponded to a reproductive event that was represented by a record in the computer file. Thus a twin birth, for example, corresponded to two events. 339 by guest on July 13, 2011 aje.oxfordjournals.org Downloaded from

Constructing Reproductive Histories by Linking Vital Records

Embed Size (px)

Citation preview

American Journal of EpidemiologyCopyright O 1997 by The Johns Hopkins University School of Hygiene and Public HealthAll rights reserved

Vol. 145, No. 4Printed In U.S.A

Constructing Reproductive Histories by Linking Vital Records

Melissa M. Adams,1 Hoyt G. Wilson,1 Dale L Casto,1 Cynthia J. Berg,1 Jeanne M. McDermott,1

James A. Gaudino,1'2'3 and Brian J. McCarthy1

Certificates of 1,449,287 live births and fetal deaths filed in Georgia from 1980 through 1992 were linked tocreate chronologies that, excluding induced abortions and ectopic pregnancies, constituted the reproductiveexperience of individual women. The authors initially used a deterministic method (whereby linking rules werenot based on probability theory) to link as many records as possible, knowing that some of the linkages wouldbe incorrect. They subsequently used a probabilistic method (whereby evaluation of linkages was developedfrom probability theory) to evaluate each linkage, and they broke those that were judged to be incorrect. Of the1.4 million records, 38% did not link to another record. From the remaining records, 369,686 chains of two ormore events were constructed. The longest chain included 12 events. Of the chains, 69% included two events;22% included three events. Longer chains tended to have lower scores for probable validity. The probability-based evaluation of chains affected 3.0% of the records that had been in chains at the end of the deterministiclinkage. A greater percentage of records in longer chains were affected by the evaluation. Unfortunately, thesmall subset of records that were the most difficult to link tended to overrepresent groups with the greatestrisk of adverse pregnancy outcomes. Researchers contemplating a similar linkage can anticipate that, for themajority of records, linkage can be accomplished with a relatively straightforward, deterministic approach.Am J Epidemiol 1997; 145:339-48.

birth certificates; epidemiologic methods; fetal death; medical record linkage; reproductive history

Population-based data on women's reproductive ex-periences over successive pregnancies have not beenavailable in the United States until recently. Cross-sectional studies have provided most of our currentunderstanding of the relations of outcomes acrosspregnancies. For example, analyses of cross-sectionaldata have led researchers to conclude that perinatalmortality increases with each subsequent birth (1).When a woman's total number of births is taken into

Received for publication December 29, 1995, and accepted forpublication September 27, 1996.

From the World Health Organization Collaborating Center Insti-tutions: Emory University Regional Perinatal Center, Centers forDisease Control and Prevention, and Division of Public Health, Stateof Georgia

1 World Health Organization Collaborating Center in PerinatalCare and Health Services Research in Maternal Child Health, Divi-sion of Reproductive Health, Atlanta, GA.

2 Office of Perinatal Epidemiology, Epidemiology and PreventionBranch, Division of Public Health Department of Human Resources,State of Georgia, Atlanta, GA.

3 Epidemic Intelligence Service, Division of Field Epidemiology,Epidemiology Program Office, Centers for Disease Control andPrevention, Atlanta, GA.

Reprint requests to Dr. Melissa M. Adams, World Health Orga-nization Collaborating Center in Perinatal Care and Health ServicesResearch in Maternal Child Health, Division of Reproductive Health,MS K-23, National Center for Chronic Disease Prevention andHealth Promotion, Centers for Disease Control and Prevention,Public Health Service, US Department of Health and Human Ser-vices, 4770 Buford Highway NE, Atlanta, GA 30341-3724.

account, however, a different pattern emerges: Therisk of perinatal mortality decreases with each subse-quent birth (1). Thus, cross-sectional studies can leadto false conclusions. In contrast, analyses of longitu-dinal data (i.e., data on an individual woman's succes-sive pregnancies) offer many advantages: They canbetter elucidate patterns across pregnancies, and theyare not biased by selective fertility (2, 3). Finally,longitudinal data are important for evaluating the suc-cess of public health interventions, such as progresstoward reducing the rate of repeated cesarean section.

Compiling longitudinal data in the United States hasbeen difficult because of the absence of unique iden-tifiers that facilitate linking records. This report de-scribes the methods used to link birth and fetal deathcertificates filed in Georgia from 1980 through 1992and gives an overview of the results.

MATERIALS AND METHODS

We linked live birth and fetal death certificates intochronological chains of events that, excluding inducedabortions and ectopic pregnancies, constituted the re-productive experience of individual women. Eachbaby or fetus corresponded to a reproductive event thatwas represented by a record in the computer file. Thusa twin birth, for example, corresponded to two events.

339

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

340 Adams et al.

Induced abortions were excluded because the vitalrecords for these events lacked personally identifyinginformation (hereafter referred to as personal identifi-ers). Certificates for ectopic pregnancies were not re-liably recorded. We use the term link to mean desig-nation of two records as belonging to the same mother.A chain is a set of two or more records that have beenlinked. If a record is linked to any record in an existingchain, that record becomes a member of the chain andis considered to be linked to all other records in thechain. If two records belonging to different chains arelinked, the two chains are combined into a singlechain. A record cannot be a member of more than onechain.

Theoretically, each person in the United States has aunique identifier: his or her Social Security number.As described below, we found that Social Securitynumbers alone were inadequate for complete linkage.Instead, we used a combination of variables, includingmaternal date of birth, first name, maiden name, andSocial Security number.

The linkage process entailed two multistep proce-dures. First, we used a deterministic method to link asmany records as possible, knowing that some of thelinkages would be incorrect. We then used a probabi-listic method to evaluate each linkage, and we brokelinkages that were judged to be incorrect. The initialrecord-linking method employed linking rules basedon our judgment and intuition. It was deterministic inthat the linking rules were not developed or justifiedon the basis of probability theory. In contrast, themethod used for breaking linkages was probabilistic inthat the rules for delinking were based on probabilitytheory.

We apply the terms match and matching both topairs of records and to pairs of values of a variablecontained in two different records. If a pair of recordsmatches, it means that the two records (events) corre-spond to the same mother. When we say that thevalues of a variable in two different records match, wemean that the two values are enough alike that weconclude that the correct values are identical. In otherwords, the values would be identical if they had beenreported, recorded, and keyed correctly. For example,one of the variables used in the deterministic linkingprocess was mother's first name; the two values "Sal-ly" and "Sallie" were considered a match and wereawarded the same score as if the two values had beenidentical. We use the term exact match to signifyidentical pairs of values. All processing of recordsconsisted of sorting and sequential processing usingSAS programs (4).

Data available for linking

The database included vital records for 104,102 fetaldeaths and 1,345,185 live births from 1980 through1992 that occurred in Georgia, regardless of maternalstate of residence, or that occurred in other states toGeorgia residents and for which certificates were sentto Georgia. Georgia law requires the filing of a fetaldeath certificate for all spontaneous terminations ofpregnancy not resulting in a live birth, regardless ofthe length of gestation before the termination occurs.We used 27 variables to create and evaluate the link-ages (table 1).

Deterministic linkage

The deterministic linkage consisted of phase I,which entailed six processing steps during which

TABLE 1. Percentage of records with data for personalidentifiers by record type, Georgia, 1980-1992

Variableby

type

MaternalMaiden namefirst nameInitial of middle nameDate of birthSocial Security no.RaceZip code of residenceYears of educationStats of birthtear of most recent

Live birthFetal death

Month of most recentLive birthFetal death

Previous live births, <2,500 gNow livingNow dead

Previous live births, £2,500 gNow livingNow dead

Previous live births, now livingPrevious live births, now deadPrevious fetal deaths, <20 weeksPrevious fetal deaths, £20 weeks

PaternalSurnameFirst nameInitial of middle nameDate of birth

InfantSurnameDate of birth

Fetaldeaths

(%)

8710067990*

980*

570*

8584

7866

95t95t

99f94t99*99*

91t90f

65656230*

0*100

Livebirths(%)

10010089999499

1009996

9997

9967

95t95f

99t94t

100*100*

96t95t

84847583

100100

• Data not recorded for 1980-1992.t Data not recorded for 1989-1992.* Data not recorded for 1980-1988.

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

Constructing Reproductive Histories 341

chains were formed and individual (previously un-linked) records were added to chains. Next followedphase n, which entailed multiple passes through thefile to combine chains belonging to the same mother.

At each step in phase I, the file was first sorted onone or more key variables; then all pairs of recordshaving the same value of the sort key were comparedand considered for linking. For example, in the firststep, all pairs of records having the same mother's dateof birth (month, day, and year) were compared andconsidered for linking. This process required sufficientcomputer memory (real or virtual) to allow all recordshaving the same sort key to be held in memory at onetime. If the data had been complete and error free, allmatching pairs of records could have been identifiedbased on a single key variable. Because some datawere missing or inaccurate, however, we repeated theprocess for multiple key variables to identify as manymatching pairs of records as possible. Appendix 1 liststhe key variables used in the six processing steps.

In each step, record matches were determined bycomparing pairs of records on 10 variables (table 2)and computing a deterministic score for each pair. Amatch on any of the variables added one point to thedeterministic score; no points were given when datawere missing. Points were subtracted for nonmatches

TABLE 2. Points assigned for comparisons between tworecords and linkage criteria of 1980-1992 Georgia live birthsand fetal death records

Type o) variablecompared between

two records

NumericMaternal date of birth*Paternal date of birthDate of most recent

event (month/year)*Maternal Social Security

no.*

NameMaiden name*Maternal first name*Paternal surnamefPaternal first nameChild's sumametChild's surname and

paternal surname onmost recent event

Match

+1+1

+1

+1

+1+1+1+1+1

+1

Points assigned

Disagree

- 10

0

-1

- 1- 1

000

0

Datamissing

00

0

0

00000

0

* Linkage criteria include matching this variable as one of atleast two maternal variables (to avoid linkage on paternalinformation only) and one of the following: 1) a deterministic scoreof 4 or greater, or 2) a deterministic score of 3 and matching on twoname variables and one numeric variable.

f Only 1 point can be given for matches between child's andfather's surnames, regardless of whether the match derived from acomparison of father's surname on record A and child's surname onrecord B or vice versa.

on some variables (table 2). Criteria for declaring amatch are listed in table 2. Mother's last name was notused in the scoring because it was not recorded onbirth certificates before 1989 and thus was not avail-able for the bulk of our records.

To allow for spelling and keying errors, we did notrequire exact matches on first names, surnames, orSocial Security numbers. For first names, we requiredthat only the first three letters match exactly. Forsurnames, we applied an algorithm that took into ac-count the length of the name and the number of match-ing letters. For Social Security numbers, we permittedas many as two transpositions or incorrect digits.

Appendix 1 includes a description of rules that con-strained the linkage at each step in phase I. These ruleswere designed to avoid the formation of separatechains corresponding to the same mother (chain frag-ments). In spite of these measures, the method resultedin some chain fragments, so that after the six stepswere completed, additional passes were made throughthe file (phase II) to consolidate any fragments. Be-cause of the sequential nature of our processing andthe potentially complex linkage patterns required forcombining multiple chains, the phase II process ofcombining chains was a nontrivial task, requiring mul-tiple passes through the file.

Probabilistic evaluation of linked records

To identify linkages that were likely to be incorrect,we evaluated each potential chain using a probabilisticapproach, wherein linkage (or delinkage) decisionswere made according to rules based on probabilitytheory. The methods were based on the general ap-proach widely ascribed to Fellegi and Sunter (5) andrefined by many other authors (6-8). Our processconsisted of the following eight elements: 1) establish-ment of the probability-based linkage scoring meth-ods; 2) selection of variables and definition of out-come sets for record-record comparisons on allvariables; 3) estimates of required probabilities fromthe database; 4) computation of scores for all recordpairs in each of the chains that had been formed by thedeterministic linkage; 5) from the record pair scores,computation of a quality index (called the weakestpath score) for each chain; 6) manual review of ran-domly selected chains to establish the cutoff score forbreaking chains; 7) application of the cutoff rule to allchains, breaking those that fall below the cutoff; and8) manual review of the few remaining chains havingmarginal scores, breaking those that appear to be in-correctly linked. Although these elements are listed inlogical sequence, there was considerable overlap anditeration, especially in steps 2-4, as we workedthrough the process of defining variables and assessing

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

342 Adams et al.

their discriminating power. These three steps requiredregular visual review of distributions of values of thevariables, outcomes of record-record comparisons onthe variables, and linked records. Step 6 also requireda visual review of many chains. The visual review ofchains in step 8 required, in comparison, only a mod-erate amount of labor.

Figure 1 lists the 18 maternal, paternal, and othervariables used in computing the probability-basedscores for pairs of linked records. For each of thevariables, X,, / = 1,2, ..., 18, the linkage score com-ponent, r,, was based on the outcome, xt, of comparingX, values in two records. For some variables, such asmaternal middle initial, we observed only whether thetwo fields matched exactly; thus the correspondingx, could take only the values "match" and "nonmatch."For other variables, the outcome set was more com-plex, as described below.

The 18 scores, rh computed from the results of thecomparisons of the 18 variables, were added to obtainthe composite, probability-based score r for eachrecord pair. This scheme is intuitively appealing inthat the score r,- for a variable is positive if the corre-sponding fields match, and it is negative if they do notmatch. Details of the computations used in probabilis-

Fetal Death and Birth Certificate Variables

Maternal1. Maiden name2. First name3. Middle initial4. Date of birth5. Years of education6. State of birth7. Race8. Zip code of residence9. Social Security number

Paternal10. First name11. Last name12. Middle initial13. Date of birth

Other14. Date of most recent live birth (later record)

versus event date of earlier record15. Date of most recent fetal death (later record)

versus event date of earlier record16. Number of previous live births17. Number of previous fetal deaths18. Event date (elapsed time between the two events;

used as an indicator of biologic plausibility).

FIGURE 1. Variables used in probabilistic evaluation of vital recordlinkages, Georgia, 1980-1992.

tic evaluation of linkages are given in Appendix 2.Adding the 18 variable scores /•„ which are logarithmsof probability ratios, to obtain an overall score r isappropriate as long as the 18 outcome variables areindependent. This assumption of independence amongthe Xj appeared generally to be a reasonable workingassumption, with a few notable exceptions, discussedbelow.

The outcome sets (values of *,) reflected multipledegrees of matching for the following variables: ma-ternal Social Security number and years of education;dates of most recent live birth and fetal death; numbersof previous live births and fetal deaths; and event dates(to measure the biologic plausibility of interval be-tween two events). The purpose was to allow forcomparisons that, although not perfect matches, sug-gested that the records belonged to the same mother.For example, when "mother's years of education" wascompared for two records, the following outcomeswere considered: Years of education matched exactly,the chronologically later record had 1 more year ofeducation than the earlier record, the later record had2 more years of education than the earlier record, andso forth. As expected, scores (r, values) were higherfor closer matches.

The outcome sets for the dates of the most recentlive birth and fetal death and the numbers of previouslive births and fetal deaths (figure 1, variables 14-17)were determined as follows. We first checked forwhether the date (month and year) of the most recentlive birth recorded on the chronologically later recordwas the same as the date of the earlier live birth. If thedates did not match exactly, we checked whether thedate on the later record was after the earlier live birth,suggesting a failure to link a live birth that occurredbetween the two records. Thinking that data for themost recent live birth could have been inaccuratelyrecorded in the field for most recent fetal death, wealso checked to see whether the date for the mostrecent fetal death recorded on the chronologically laterrecord was the same as the date of the earlier live birth.We repeated these steps for the date of the most recentfetal death, checking the date of the most recent fetaldeath for possible incorrect recording in the field forthe most recent live birth. Finally, we separatelychecked fetal deaths and live births for consistencybetween the number of previous events of each typerecorded on the certificate and the number of preced-ing records in the chain.

We used the interval between two events (figure 1,variable 18) as an indicator of biological plausibility ofa record match, judging that very short intervals wereunlikely. The outcome set consisted of the followingthree categories: 1) less than 140 days between an

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

Constructing Reproductive Histories 343

event (live birth or fetal death) and a subsequent livebirth, 2) less than 55 days between an event and asubsequent fetal death, and 3) any other intereventinterval. For outcomes 1 and 2, we required furtherthat the event pair under consideration not be twins.Specific categories used for other variables are avail-able from the authors.

The scoring method accounted for the specific val-ues of the following variables: maternal first name andmaiden name, date of birth, zip code of residence,race, and state of birth; and paternal date of birth, firstname, and last name. For each of these variables, theset of outcomes, x,, reflected not only the degree ofmatch but also the particular values for those situationswhen fields matched exactly. Thus greater weight wasgiven to matches on less common values. For exam-ple, two records with the maiden name of "Adams,"which is relatively common in Georgia, had a lowerscore than two records with the name "Gaudino,"which is much less common. Likewise, two recordswith the mother's state of birth as Georgia had a lowerscore than two records with the mother's state of birthas Delaware.

In preparation for the scoring of record pairs, we hadto estimate from the database probabilities from whichmatching scores, r,-, were computed. For each possiblecomparison outcome value, JC,-, for each variable Xh weestimated the following two conditional probabilities:1) P(xj\m), the probability of observing outcome *,given that the record pair is a match (i.e., the recordscorrespond to the same mother); and 2) P(x,\m'), theprobability of observing outcome x, given that therecord pair does not match. (See Appendix 2 foradditional notational definitions and development ofthe probabilistic scoring method.) For those variablesfor which the comparison outcome did not reflect thespecific value of the variable (such as Social Securitynumber), we estimated probabilities by taking advan-tage of the completed deterministic linkage. Fromexisting chains, we used pairs of records that metstringent matching criteria as a set of "true" matches toestimate values of P{x\m). Similarly, we used a largepool of record pairs that were clearly nonmatches toestimate values of P{x\m').

For those variables (such as names) for which out-come (and thus P(xj\m')) varied according to the par-ticular value of the variable, we used the frequencydistributions of values in the database to estimateprobabilities associated with each individual value(see Appendix 2 for details). By this means, matcheson uncommon names received higher scores thanmatches on common names, as described above.

When data were missing for a variable in either orboth of the records under consideration, the corre-sponding r, was set to zero, so that points were neitheradded to nor subtracted from the score.

In applying these scoring methods to the delinkingof existing chains, we first computed the composite,probability-based score r for each pair of records in achain (not just adjacent pairs). We then computed a"weakest path" score (described below) for the chain,and chains whose weakest paths were less than 16were broken. This step yielded some shorter chainsand some unlinked records. Finally, we reviewed the20 chains representing 133 events that met either ofthe following two criteria: 1) the chain contained 10 ormore events, or 2) the chain contained five or moreevents and had a weakest path score of 16-19. Thosethat were judged to be incorrectly linked were manu-ally split. This final step was undertaken because ofthe low likelihood that a woman had 10 or moreinfants during the 13 years of the study and because ofthe observation that some of the longer chains withlow scores were incorrectly linked.

The weakest path score is the lowest probability-based score in the "best" path (not necessarily thesequential, chronological connection) between anytwo records in the chain. For example, consider athree-record chain consisting of records A, B, and C,with probability-based scores shown in figure 2. Theweakest path score is 25 because any two records(events) in the chain can be connected by a path that isnever lower than 25. For example, the best path be-tween records B and C is B-A-C, in which the prob-ability-based scores are 30 and 25.

We established the weakest path cutoff value of 16for breaking chains by manually reviewing manychains that had a wide range of record-recordprobability-based scores, r. We judged most recordpairs having link scores of 16 as valid matches. Scoreshigher than 16 indicated even greater likelihood ofmatch validity. To put the value of 16 in perspective,the score component, rh for an exact match on theSocial Security number variable was 18. The range of

r=30 r=25

r=15FIGURE 2. Weakest path in a three-record chain used in theconstruction of reproductive histories in Georgia, 1980-1992. r,probability-based score.

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

344 Adams et al.

score components, r,, for exact matches on other keyvariables included the following: maternal maidenname, 6-13; maternal first name, 4-12; and maternaldate of birth, 8-10. Recall that score components, rh

were negative for nonmatching fields, resulting in thesubtraction of points from the composite probability-based score r.

RESULTS

Data used for linking

In Georgia, the reported annual number of livebirths increased from 95,640 in 1980 to 114,235 in1992; and the reported annual number of fetal deathsincreased from 6,556 in 1980 to 10,636 in 1992.Assuming that approximately 15 percent of clinicallyrecognized pregnancies end in spontaneous loss (9)and using the number of reported live births and fetaldeaths to approximate the number of clinically recog-nized pregnancies, we estimated that 217,393 sponta-neous pregnancy losses (fetal deaths) occurred. Only104,102 fetal death certificates were filed, suggestingapproximately 52 percent underreporting of fetaldeaths.

Some variables were not collected in all years orwere collected in different formats on fetal death andlive birth certificates (table 1). In general, personalidentifiers were more completely recorded on certifi-cates for live births than for fetal deaths. For livebirths, completeness of reporting was consistent acrossthe 13 years of the study except for two variables.First, for the mother's Social Security number, com-pleteness increased from 87 percent in 1980 to 96percent in 1988 and subsequent years. Second, for thenumber of previous live births less than 2,500 g andthe number of previous spontaneous abortions beyond20 weeks, completeness increased from approximately88 percent in the early 1980s to 99 percent in 1984 andlater years.

For fetal deaths, completeness of recording for sev-eral variables changed over time. Completeness formaiden name declined from 96 percent in the early1980s to 77 percent in 1992. Completeness for mater-nal education declined from 64 percent in the early1980s to 51 percent in the early 1990s. Completenessfor the dates of the most recent live birth and fetaldeath increased from about 80 percent in the early1980s to 98 percent in the early 1990s. Completenessfor the father's first and last names decreased fromabout 72 percent in the early 1980s to 55 percent in1992. Completeness for information on fetal deathcertificates did not appear to vary with length of ges-tation at delivery.

When initially planning the linkage approach, weconsidered but ultimately decided against basing itexclusively on the mother's Social Security number.The mother's Social Security number was not col-lected on fetal death certificates from 1980 through1988 and tended to be missing in a nonrandom manneron other records. It was less likely to be recorded oncertificates for babies bom to younger women, womenwith lower levels of education, and women who werenot born in the United States (table 3). A third con-sideration was that the mother's Social Security num-ber was not always accurately recorded. More than7,100 pairs of records were observed that had the samematernal Social Security number but had different

TABLE 3. Percentage of birth certificates with misting datafor maternal Social Security number by selected maternal andinfant characteristics, Georgia, 1980-1992

VariableNo. of

records= 1,345,185)

Records wtthmissing Social

Security no.

Maternal age (years)<1818-1920-2425-2930-̂ 34>34Unknown

Maternal education (years)<121213-15>15Unknown

Marital statusMarriedOtherUnknown

Maternal raceWhiteBlackOtherMissing

Maternal state of birth*GeorgiaOther USOther countryUnknown

Infant birth weight (g)<1,5001,500-2,499>2,499Unknown

94,383137,979424,687378,667217,521

76,84515,103

353,187556,711229,105203,516

2,666

967,875376,182

1,128

858,842469,462

2,75314,128

810,428427,794

10,0092,909

21,74091,224

1,231,346875

21.46.73.83.23.34.5

95.6

11.74.53.23.7

50.9

5.38.0

57.9

5.86.3

12.615.3

5.95.1

16.76.2

10.57.85.9

29.6

•Data available for live birth only.

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

Constructing Reproductive Histories 345

maternal dates of birth, first names, and maidennames. Examination of the personal identifiers forthese pairs suggested that the events corresponded todifferent mothers and that the matching Social Secu-rity numbers were erroneous.

Results of linkage

Of the 1.4 million records in the database, 38 per-cent (551,391) did not link to another record. From theremaining 897,896 events, 369,686 chains of two ormore events that occurred to the same woman wereconstructed (table 4). The longest chain included 12events. The preponderance of chains contained twoevents. For most chains, the weakest path had a scoreof 30 or more. Chains with greater numbers of events,however, tended to have lower scores for their weakestpaths.

Impact of probability-based evaluation

The probability-based evaluation of chains resultedin delinkages affecting approximately 27,000 records,representing 3.0 percent of the records that had been inchains at the end of the deterministic linkage. Propor-tionately greater numbers of records in longer chainswere affected by the assessment. For example, of the5,768 records that were in chains of seven events after

the deterministic step, 29.1 percent changed to shorterchains after the probability-based evaluation. In con-trast, among the 247,275 records that were in chains ofthree events, 2.4 percent changed to shorter chains;and among the 506,012 records in chains of twoevents, 0.7 percent were split apart. The probability-based evaluation also affected proportionately morerecords of women whose marital status was unknown,who were of races other than white, who had 12 orfewer years of education, and who were born before1950. Infant outcome (fetal death or live birth) andbirth weight did not affect the likelihood that a recordwas delinked.

DISCUSSION

Major strengths of our linkage approach were iden-tifying as many potentially correct linkages as possibleand evaluating these linkages using a wide range ofancillary data. When we attempted linkage initially,we observed that women with less complete personalidentifiers tended to have characteristics associatedwith increased risks of adverse pregnancy outcomes.For example, women whose certificates lacked theirSocial Security numbers tended to be younger or haveless education—factors previously associated with ad-verse pregnancy outcomes (10, 11). Similarly, womenwhose certificates lacked paternal information or whohad different fathers for successive pregnancies wereoften not married and thus at increased risk for adversepregnancy outcomes (10, 11). Because of the publichealth importance of women with these characteris-tics, we avoided basing the linkages on only a fewpersonal identifying variables. When evaluating thelinkages, we attempted to compensate for potentialoverlinkage by using a probabilistic approach based ona wide range of personal attributes. In addition, every

TABLE 4. Reproductive history chains constructed by events in the chain and the score for theweakest path between events, Georgia, 1980-1992

No. crievents In

chain

2

3

4

5

6

7

>7

Total

16-19

1,262

736

267

80

32

17

8

2,402

%

0.5

0.9

1.2

1.3

1.7

2.8

2.7

0.7

20-29

11,412

7,994

3,394

1,369

494

166

87

24,916

Score for weakest path

%

4.5

9.7

15.1

21.9

26.5

27.2

29.9

6.7

i30

243,189

73,527

18,876

4,815

1,338

427

196

342,368

%

95.0

89.4

83.8

76.9

71.8

70.0

67.4

92.6

Total

255,863

82,257

22,537

6,264

1,864

610

291

369,686

%•

69.2

22.3

6.1

1.7

0.5

0.2

0.1

100

•Rounded.

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

346 Adams et al.

linkage was assigned a score that corresponds to itsvalidity, permitting an analyst to select only linkageswith highly valid scores.

Throughout the linkage process, we were concernedthat linkages would be driven by paternal information,resulting in the linking of events that had the samefather but different mothers. Creation of links of thistype was avoided in the deterministic part by requiringmatches on two or more maternal variables. Addition-ally, when evaluating linkages, a limited number ofpaternal variables was considered, thereby restrictingthe impact of paternal information on the scores. Thus,linkages occurring solely on the basis of paternal in-formation were judged to be very unlikely, and themanual review of records supported this impression.

A related consideration was the possibility that wefailed to link events that had different fathers but wereexperienced by the same mother. An evaluation of thislinkage, reported elsewhere (12), showed rates of ac-curate linkage only slightly lower when paternity dif-fered among a mother's births or when paternal infor-mation was not stated for one or more events. Thus,we believe that this linkage methodology yielded dataappropriate for assessing the impact of changes inpaternity on pregnancy outcome.

Potential weaknesses in the approach included alack of independence among some of the variablesused in the probabilistic evaluation of linkages and theoccurrence of incorrect linkages between family mem-bers, especially mothers who were twins. The proba-bilistic evaluation of the linkages was based on theassumption that the variables used were statisticallyindependent of each other. Generally, this appearedtrue. A few instances were observed in which thisassumption did not hold, such as within ethnic groupsfor whom a small number of first names and surnameswere used, thus violating the independence betweenfirst name and maiden name. This problem was exac-erbated by the rareness of these ethnic names, whichreceived high point values for matches.

In reviewing the linkages, we observed a few thatappeared to occur between mothers who were twins.These mothers had identical information for date ofbirth, state of birth, maiden name, and race and oftenhad very similar information for their first name (e.g.,Mary and Martha), years of education, Social Securitynumber, and zip code. Despite the use of a wide rangeof identifying information, some incorrect linkage ofthe offspring of mothers who were twins may beunavoidable, remedied only by manual review ofmany records. Because this type of error appearedrare, we did not undertake this review.

One cost of the approach was the substantial pro-graming and computer resources needed to accomplish

the linkage and the probability-based evaluation. Thesubstantial resources were necessitated by the largenumber of records that were used and variables thatwere considered. Many time-consuming computerruns were required to build the tables of frequenciesneeded to assign probabilistic scores associated withindividual variables. When the linkage was started, nocommercially distributed software was available thatmet the needs of the project.

Beyond our methodological approach, the availabledata also influenced the success of the linkages. Bylimiting the database to certificates filed in Georgia,we excluded from the linkage events of women whohad a delivery in another state and subsequentlymoved to Georgia. Because a national data set of fetaldeaths and live births that contains personal identifiersis not available, there was no good alternative to usingGeorgia data. Limiting the database to events of1980-1992 meant that there probably were not enoughdata to create lifetime pregnancy histories for manywomen. The likely underregistration of fetal deathsprobably caused the linkage of these events to beincomplete. Finally, inaccuracies and omissions inpersonal identifying data inevitably limited the abilityto link records.

Probabilities were used not to link records, but onlyto evaluate chains that had already been constructed.Developing the probabilities needed for linkage scor-ing required a set of records that were assumed to becorrectly linked; thus, a probabilistic linkage could nothave been done without first performing a determin-istic linkage.

These linked data are being used to investigate anumber of relations, such as the accuracy of the vag-inal birth after cesarean section delivery method on thebirth certificate. Analyses are in progress to evaluatethe patterns of maternal behaviors across pregnancies,such as smoking and delayed entry to prenatal care.The data are also being used to examine the associa-tion between length of interpregnancy interval andpregnancy outcome, adequacy of prenatal care andrisk of intrauterine growth retardation, and the impactof changes in paternity on adverse pregnancy out-come.

Researchers contemplating a similar linkage may beencouraged to know that, for the majority of records,linkage can be accomplished with a relatively straight-forward, deterministic approach. Evaluation of ourinitial linkages shows that nearly all of them are ac-curate and that failure to link births correctly was rare(12). Unfortunately, the small subset of records thatare the most difficult to link tend to overrepresentgroups at highest risk of adverse pregnancy outcomes.For these records, evaluation of a wide range of iden-

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

Constructing Reproductive Histories 347

tifying information may be needed. Future research isneeded to evaluate alternate approaches for linkingthese records.

ACKNOWLEDGMENTS

The authors thank the following individuals: CynthiaMervis, who identified differences in the reporting formatsof birth and fetal death certificates over the study period;staff of the Information Resources Management Office,Centers for Disease Control and Prevention, who providedcomputational resources; Michael Lavoie, Director, VitalStatistics Office, Center for Health Information Branch,Division of Public Health, State of Georgia, who providedvital records data and consulted on their interpretation;Virginia Floyd, Director, Maternal and Child HealthBranch, Division of Public Health, State of Georgia, whofacilitated the process of conducting the linkage; HaniAtrash, Chief, Pregnancy and Infant Health Branch, Divi-sion of Reproductive Health, National Center for ChronicDisease Prevention and Health Promotion, Centers for Dis-ease Control and Prevention, who provided administrativeand technical support.

REFERENCES

1. Golding J. The epidemiology of perinatal death. In: Kiely M,ed. Reproductive and perinatal epidemiology. Boca Raton,FL: CRC Press, 1991:406-8.

2. Skjaerven R, Wilcox AJ, Lie RT, et al. Selective fertility andthe distortion of perinatal mortality. Am J Epidemiol 1988;128:1352-63.

3. Bakketeig LS, Hoffman HJ. Perinatal mortality by birth orderwithin cohorts based on sibship size. Br Med J 1979;2:693-6.

4. SAS Institute, Inc. SAS language: reference, version 6, 1st ed.Cary, NC: SAS Institute, Inc., 1990.

5. Fellegi IP, Sunter AB. A theory of record linkage. J Am StatAssoc 1969;64:1183-210.

6. Newcombe HB. Handbook of record linkage: methods forhealth and statistical studies, administration, and business.Oxford, United Kingdom: Oxford University Press, 1988.

7. WinkJer WE. Using the EM algorithm for weight computationin the Fellegi-Sunter model of record linkage. Proceedings ofSurvey Research Methods Section. Alexandria, VA: AmericanStatistical Association, 1988:667-71.

8. Thibaudeau Y. The discrimination power of dependencystructures in record linkage. Surv Methodol 1993; 19:31—8.

9. Hertz-Picciotto I, Samuels SJ. Incidence of early pregnancyloss. N Engl J Med 1988,319:1483-4.

10. Committee to Study the Prevention of Low Birthweight, In-stitute of Medicine. Preventing low birthweight. Washington,DC: National Academy Press, 1985:51.

11. Berkowitz GS, Papiemik E. Epidemiology of preterm birth.Epidemiol Rev 1993;15:414-43.

12. Adams MM, Berg CJ, McDermott JC, et al. Evaluation ofreproductive histories constructed by linking vital records.Paed Perinat Epidemiol (in press).

APPENDIX 1

Details of the Deterministic Linkage

Each step in phase I consisted of a sort followed by aprocessing run that linked records. The sort keys of the sixsteps are the following:

1. mother's date of birth,2. mother's Social Security number,3. mother's maiden name and the first initial of her first

name,4. mother's Social Security number,5. mother's maiden name and the first initial of her first

name,6. mother's maiden name and the first initial of her first

name.

We denoted the linkage status of a record as "U" if it wasunlinked (i.e., not linked to any other record) and as "L" ifit had been linked to another record. At any point in thelinking process, then, we designated the linkage status ofany pair of records as U-U, L-L, or U-L, depending onwhether neither, both, or only one of the records had pre-viously been linked to another record. For each of the sixsteps, we specified which statuses were eligible for linkingand which, if any, were eligible to be "noted." Linking tworecords was done by setting the variable for chain number tothe same value in both records. Noting a linkage was doneby setting the value of an auxiliary variable in one of therecords to the value of the chain number variable of theother record; this information was later used for consolidat-ing chains in phase n. The categories of linkages we per-mitted in the six steps were as follows:

1. link: U-U, U-L, L-L; note: none;2. link: U-L; note: L-L;3. link: U-L; note: L-L;4. link: U-L; U-U note: none;5. link: U-L; note: none;6. link: U-L, U-U; note: none.

Thus, in the first step, any pair of records that satisfied thematching criteria was linked. Linkages of the L-L type wereconsolidated into single chains within this step. It waspossible to consolidate chains in the first step because therewere no linkages to records outside the block of recordshaving the same sort key. Such consolidation of L-L link-ages was not possible in steps 2-6 because the members ofa chain could be spread throughout the file. In steps 2, 3, and5, only pairs of records wherein one of the records waspreviously linked and the other was unlinked were eligiblefor linking. The idea was to try to add records to existingchains where possible, rather than starting new chains.Because the rules disallowed U-U and L-L links in steps 2,3, and 5, the process included later steps using the same sortkeys to identify any remaining valid links that had not beenpermitted in earlier steps.

APPENDIX 2

Details of Probabilistic Evaluation of Linkages

For any pair of records, the event of having the samemother was denoted as m, the event of having different

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from

348 Adams et al.

mothers as m', and the vector of outcomes of the 18 vari-ables as x = (xx,x2, •••,-Jig)- The probability, P{m\x), thattwo records belong to the same mother, given the observedpattern of field-by-field outcomes, can be computed asfollows from Bayes' theorem:

P(m)P(x\m)

and

P(m)P(x\m)

1

1 + KIR

- P(m)]P(x\m')

where

K = [1 - P(m)]/P(m)

R = P(x\m)/P(x\m').

The quantity P{m) is the unconditional probability thattwo randomly selected records match (i.e., belong to thesame mother). P(m) and K are constants for our data set. Asthe ratio R increases, P(m\x) increases. P(m) and K couldhave been estimated—at least roughly— from our data, butit was convenient to use

r = log(fl) = log[P(x\m)] - \og]P(x\m')]

as the working index of the likelihood of a match betweentwo records, rather than computing P(m\x) per se. Thus, itwas unnecessary to estimate values of P(m) and K.

If the x, are mutually independent (conditional on truematch/nonmatch status), then

P{x\m) = P{x\m) • P{x2\m) . . . P{xn\m)

and

P{x\m') = P(xx\m') • P(x2\m') . . . P(x]S\m')

from which

P(X]\m) P(x2\m) P(xls\m)

P{x\\m') ' P(x2\m') ' ' ' P(xia\m')R =

r =

= T\ riS

where

r, =

Thus, computation of the score, rb requires estimation ofthe conditional probabilities P(x\m) and P(x,\m') or the ratioP(x\m)IP(x,\m') corresponding to the outcome x,. For vari-ables for which the outcome does not reflect the specificvalue of the variable, we can estimate P(xj\m) from a largepool of correctly matched records. For this purpose, we usedthose record pairs produced by the deterministic linkage forwhich we were very confident of the correctness of thelinkage. To estimate P(xj\m'), it was relatively easy toextract a large set of record pairs that clearly did not match.

For variables whose outcomes reflected the specific valueof the variable, such as name variables, we estimated theunconditional probability of each outcome, P(x,), from thefrequency distribution of the values of the variable and thenassumed that when x{ reflects identical values of X, in thetwo records, P(x,\m') = [P(x,)]2. For example, the probabil-ity of two records both having the maiden name "Adams"when the mothers are known to be different is approxi-mately the same as when the two records are selected totallyat random. We assumed further that P(x,\m) = P(x,), anapproximation that causes very little distortion in the scorecomponent, r,-. This leads to

R, =

when Xj represents matching values of Xt.

Am J Epidemiol Vol. 145, No. 4, 1997

by guest on July 13, 2011aje.oxfordjournals.org

Dow

nloaded from