41
Comparing SAS Files Laura Schild

Comparing SAS Files

Embed Size (px)

Citation preview

Page 1: Comparing SAS Files

Comparing SAS Files

Laura Schild

Page 2: Comparing SAS Files

Consider using PROC COMPARE the next time you need to…

• prepare to combine two data sets, so you know what variables need to be reformatted, etc.

• evaluate newly collected data in comparison to an existing file (ex. CRANE Study / Healthcore files)

• test whether program revisions have occurred as expected

• examine whether two algorithms for computing certain variables produce comparable results

Page 3: Comparing SAS Files
Page 4: Comparing SAS Files

PROC COMPARE with No Options

proc compare base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201201 ; title1 'PROC COMPARE with No Options';run;

Page 5: Comparing SAS Files

PROC COMPARE with No Options (continued)

Page 6: Comparing SAS Files

PROC COMPARE with No Options (continued)

Page 7: Comparing SAS Files

PROC COMPARE with No Options (continued)

Page 8: Comparing SAS Files

PROC COMPARE with No Options (continued)

Page 9: Comparing SAS Files

PROC COMPARE with No Options (continued)

Page 10: Comparing SAS Files

Comparing Contents OnlyWhen all you really care about is the contents, add the NoValues and

ListVar options. • NOVALUES suppresses the report of the value comparison results. • NOSUMMARY suppresses the data set, variable, observation, and values comparison

summary reports. • LISTVAR lists all variables that are found in only one data set. • WARNING displays a warning message in the SAS log when differences are found.

proc compare novalues nosummary listvar warning base = TrackingSystem_BR_201112 compare = TrackingSystem_BR_201201 ; title1 'PROC COMPARE with NoValues, NoSummary and ListVar

Options';run;

Page 11: Comparing SAS Files

Comparing Contents Only (continued)

Page 12: Comparing SAS Files

Comparing Contents Only (continued)

Page 13: Comparing SAS Files

Comparing Contents Only (continued)

Page 14: Comparing SAS Files

The Warning Option

Note: The NoValues and NoSummary options suppress printing reports, but SAS still compares the records. Consequently, when you use the Warning option, you will still get warnings even if the contents are identical.

Page 15: Comparing SAS Files

PROC COMPARE Identical Files with Warning Option

*-------------------------------------------------------------------*;

* PROC COMPARE Identical Files with Warning Option *;*---------------------------------------------------------------

----*;data TrackingSystem_BR_201112_Copy;

set TrackingSystem_BR_201112;run;

options pageno=1;proc compare warning

base = TrackingSystem_BR_201112compare = TrackingSystem_BR_201112_Copy;title1 'PROC COMPARE Identical Files with Warning Option';

run;

Page 16: Comparing SAS Files

PROC COMPARE Identical Files with Warning Option (output)

Page 17: Comparing SAS Files

PROC COMPARE Identical FilesSorted Differently

*-------------------------------------------------------------------*;* PROC COMPARE with Warning Option *;* - Identical Files Sorted Differently *;*-------------------------------------------------------------------*;proc sort data=TrackingSystem_BR_201112; by MemberNumber;run;proc sort data=TrackingSystem_BR_201112_copy; by LastName;run;

options pageno=1;proc compare warning

base = TrackingSystem_BR_201112compare = TrackingSystem_BR_201112_copy;title1 'PROC COMPARE with Warning Option - Identical Files Sorted Differently';

run;

Page 18: Comparing SAS Files

PROC COMPARE Identical FilesSorted Differently (Log)

5657 options pageno=1;58 proc compare warning59 base = TrackingSystem_BR_20111260 compare = TrackingSystem_BR_201112_copy61 ;62 title1 'PROC COMPARE with Warning Option - Identical Files Sorted Differently';63 run;

WARNING: Values of the following 18 variables compare unequal: MemberNumber FirstName LastName PhoneNumber Gender BirthDate StreetAddress StreetAddress2 City ZipCode NumberOfMedications_BR DrugName1_BR DrugName2_BR PrescriptionNumber1_BR PrescriptionNumber2_BR TransferNumberLive Schedule

TransferNumberScriptWARNING: The data sets WORK.TRACKINGSYSTEM_BR_201112 and

WORK.TRACKINGSYSTEM_BR_201112_COPY contain unequal values.NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112.NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112_COPY.NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by

the "BEST" format.NOTE: PROCEDURE COMPARE used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

Page 19: Comparing SAS Files

PROC COMPARE Identical FilesSorted Differently (Output)

Page 20: Comparing SAS Files

PROC COMPARE Identical FilesSorted Differently (Output, page 2)

Page 21: Comparing SAS Files

PROC COMPARE Identical FilesSorted Differently (Output, page 19)

• For each variable, you get a list similar to the following. Full output not included due to patient confidentiality and due to length.

Page 22: Comparing SAS Files

PROC COMPARE by IDWith MaxPrint Option (code)

*-------------------------------------------------------------------------------*;* PROC COMPARE by ID with ListObs and MaxPrint Options *;*-------------------------------------------------------------------------------*;libname local 'G:\Investigators\Davis\Davis_CRANE (SJS Antiepileptic)\SAS Programs\Data';libname kpga 'G:\Investigators\Davis\Davis_CRANE (SJS Antiepileptic)\SAS Programs\Data\KPGA';

proc sort data=kpga.CRANE01v3_KPGA_Cohort_NoID; by StudyID Cohort;proc sort data=local.CRANE01v5_KPGA_Cohort_NoID; by StudyID Cohort;run;

options pageno=1;PROC COMPARE ListVar ListObs MaxPrint=(10,500)

BASE = kpga.CRANE01v3_KPGA_Cohort_NoID COMPARE = local.CRANE01v5_KPGA_Cohort_NoIDLISTOBS;ID StudyID Cohort;title1 'PROC COMPARE by ID with ListVar, ListObs, and MaxPrint Options';

RUN;

Page 23: Comparing SAS Files

MaxPrint Option• LISTOBS lists all observations that are found in only one data set. • MAXPRINT=total | (per-variable, total) specifies the maximum

number of differences to print, where – total is the maximum total number of differences to print. The default

value is 500 unless you use the ALLOBS option (or both the ALLVAR and TRANSPOSE options), in which case the default is 32000.

– per-variable is the maximum number of differences to print for each variable within a BY group. The default value is 50 unless you use the ALLOBS option (or both the ALLVAR and TRANSPOSE options), in which case the default is 1000.

– The MAXPRINT= option prevents the output from becoming extremely large when data sets differ greatly.

Page 24: Comparing SAS Files

PROC COMPARE by IDWith MaxPrint Option (output)

Page 25: Comparing SAS Files
Page 26: Comparing SAS Files
Page 27: Comparing SAS Files
Page 28: Comparing SAS Files

Comparing Multiple Files

Page 29: Comparing SAS Files

Run PROC CONTENTS• Run a PROC CONTENTS on each file to be included in the

comparison, keeping the variable name, type, and length.

*-------------------------------------------------------------------------------*;* Compare the variable names/formats on each of the Tracking System BR files. *;*-------------------------------------------------------------------------------*;proc contents data=qa.TrackingSystem_BR_201112 noprint out=TS_BR_201112 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201201 noprint out=TS_BR_201201 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201202 noprint out=TS_BR_201202 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201203 noprint out=TS_BR_201203 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201204 noprint out=TS_BR_201204 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201205 noprint out=TS_BR_201205 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201206 noprint out=TS_BR_201206 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201207 noprint out=TS_BR_201207 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201208 noprint out=TS_BR_201208 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201209 noprint out=TS_BR_201209 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201210 noprint out=TS_BR_201210 (keep=NAME TYPE LENGTH);proc contents data=qa.TrackingSystem_BR_201211 noprint out=TS_BR_201211 (keep=NAME TYPE LENGTH);run;

Page 30: Comparing SAS Files

Get All Variable Names

• Combine all of the files, keeping only the variable name, and then eliminate duplicates.

data TS_BR_AllNames (keep=name);set TS_BR_201112 TS_BR_201201 TS_BR_201202 TS_BR_201203 TS_BR_201204 TS_BR_201205 TS_BR_201206 TS_BR_201207 TS_BR_201208 TS_BR_201209 TS_BR_201210 TS_BR_201211 ;

run;

proc sort data=TS_BR_AllNames nodupkey; by name;run;

Page 31: Comparing SAS Files

Combine All Contents Using SQLproc sql;

create table TS_BR_AllContents asselect a.name , b.type as type_201112

, c.type as type_201201, d.type as type_201202, e.type as type_201203, f.type as type_201204, g.type as type_201205, h.type as type_201206, i.type as type_201207, j.type as type_201208, k.type as type_201209, l.type as type_201210, m.type as type_201211, b.length as length_201112 , c.length as length_201201 , d.length as length_201202 , e.length as length_201203 , f.length as length_201204 , g.length as length_201205 , h.length as length_201206

Page 32: Comparing SAS Files

Combine All Contents Using SQL, i.length as length_201207 , j.length as length_201208 , k.length as length_201209 , l.length as length_201210 , m.length as length_201211

from TS_BR_AllNames as aleft join TS_BR_201112 as b on a.name = b.nameleft join TS_BR_201201 as c on a.name = c.nameleft join TS_BR_201202 as d on a.name = d.nameleft join TS_BR_201203 as e on a.name = e.nameleft join TS_BR_201204 as f on a.name = f.nameleft join TS_BR_201205 as g on a.name = g.nameleft join TS_BR_201206 as h on a.name = h.nameleft join TS_BR_201207 as i on a.name = i.nameleft join TS_BR_201208 as j on a.name = j.nameleft join TS_BR_201209 as k on a.name = k.nameleft join TS_BR_201210 as l on a.name = l.nameleft join TS_BR_201211 as m on a.name = m.name;

quit;

Page 33: Comparing SAS Files

Report Variables Not On All Filesproc print data=TS_BR_AllContents;

where type_201112 = .or type_201201 = .or type_201202 = .or type_201203 = .or type_201204 = .or type_201205 = .or type_201206 = .or type_201207 = .or type_201208 = .or type_201209 = .or type_201210 = .or type_201211 = .;

title3 'Variable Name Not On All Tracking System BR Files';run;

Page 34: Comparing SAS Files

Report Variables Not On All Files

Page 35: Comparing SAS Files

Correct Discrepancies• If one or more files contain variables that they should not have (e.g. HRN

on the 201201 file), drop those variables.• If one or more files are missing variables that they should have, add

them, by:– linking to another source (if available)– hardcoding a value– adding dummy variables w/ missing values– getting a replacement file if necessary

• If there are discrepancies in variable names, you might see both “extra” variables and “missing” variables that need to be fixed by renaming them.

• Make corrections and rerun the PROC CONTENTS and PROC SQL.• You might also want to rerun the discrepancy report to make sure you

fixed the discrepancies correctly.

Page 36: Comparing SAS Files

Report Variable Type Discrepancies*-------------------------------------------------------------------------------*;* Check variable types for consistency. *;*-------------------------------------------------------------------------------*;proc print data=TS_BR_AllContents;

where type_201112 ne type_201211or type_201201 ne type_201211or type_201202 ne type_201211or type_201203 ne type_201211or type_201204 ne type_201211or type_201205 ne type_201211or type_201206 ne type_201211or type_201207 ne type_201211or type_201208 ne type_201211or type_201209 ne type_201211or type_201210 ne type_201211;

title3 'Variable Type Discrepancy On Tracking System BR Files';run;

Page 37: Comparing SAS Files

Report Variable Type Discrepancies

Page 38: Comparing SAS Files

Correct Type DiscrepanciesConverting Character to Numeric

%macro CorrectType (file,var,tempvar);data TrackingSystem_BR_&file (drop=&tempvar);

set TrackingSystem_BR_&file (rename=(&var=&tempvar));&var = &tempvar * 1;

run;%mend CorrectType;

%CorrectType (201201,PrescriptionNumber2_BR,RxNum2_BR_Char);%CorrectType (201203,PrescriptionNumber2_BR,RxNum2_BR_Char);%CorrectType (201204,PrescriptionNumber2_BR,RxNum2_BR_Char);

• Then rerun the PROC CONTENTS and PROC SQL.• You might also want to rerun the discrepancy reports

to make sure you fixed the type discrepancies correctly.

Page 39: Comparing SAS Files

Report Variable Length Discrepancies*-------------------------------------------------------------------------------*;* Check variable lengths for consistency. *;*-------------------------------------------------------------------------------*;data TS_BR_AllContents;

set TS_BR_AllContents; length_max = max(length_201112,length_201201,length_201202,length_201203, length_201204,length_201205,length_201206,length_201207, length_201208,length_201209,length_201210,length_201211);

run;

proc print data=TS_BR_AllContents;where length_201112 ne length_201211

or length_201201 ne length_201211or length_201202 ne length_201211or length_201203 ne length_201211or length_201204 ne length_201211or length_201205 ne length_201211or length_201206 ne length_201211or length_201207 ne length_201211or length_201208 ne length_201211or length_201209 ne length_201211or length_201210 ne length_201211;

var name length_max length_2: type: ;title3 'Variable Length Discrepancy On Tracking System BR Files';

run;

Page 40: Comparing SAS Files

Report Variable Length Discrepancies

Page 41: Comparing SAS Files

Combine Files Keeping Longest Variable Lengths

*-------------------------------------------------------------------------------*;* Combine all of the Tracking System BR files. *;*-------------------------------------------------------------------------------*;data TrackingSystem_BR_All;

length City $16DrugName2_BR $21FirstName $11LastName $17PRODUCTID $14PreviouslyAsked $4StreetAddress $30StreetAddress2 $28;

set TrackingSystem_BR_201112 (in=a)TrackingSystem_BR_201201 (in=b)TrackingSystem_BR_201202 (in=c)TrackingSystem_BR_201203 (in=d)TrackingSystem_BR_201204 (in=e)TrackingSystem_BR_201205 (in=f)TrackingSystem_BR_201206 (in=g)TrackingSystem_BR_201207 (in=h)TrackingSystem_BR_201208 (in=i)TrackingSystem_BR_201209 (in=j)TrackingSystem_BR_201210 (in=k)TrackingSystem_BR_201211 (in=l);