7

Click here to load reader

Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

Embed Size (px)

Citation preview

Page 1: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

Alternatives to Merging SAS Data Sets … But Be Careful

Michael J. Wieczkowski, IMS HEALTH, Plymouth Meeting, PA

Abstract

The MERGE statement in the SASprogramming language is a very useful tool incombining or bridging information frommultiple SAS data sets. If you work with largedata sets the MERGE statement can becomecumbersome because it requires all inputdata sets to be sorted prior to the execution ofthe data step. This paper looks at two simplealternative methods in SAS that can be usedto bridge information together. It alsohighlights caveats to each approach to helpyou avoid unwanted results.

PROC SQL can be very powerful but if youare not familiar with SQL or if you do notunderstand the dynamics of SQL “joins” itcan seem intimidating. The first part of thispaper explains to the non-SQL expert, theadvantages of using PROC SQL in place ofMERGE. It also highlights areas wherePROC SQL novices can get into trouble andhow they can recognize areas of concern.

The second section shows anothertechnique for bridging information togetherwith the help of PROC FORMAT. It usesthe CNTLIN option in conjunction with thePUT function to replicate a merge. Caveatsto this method are listed as well.

What’s wrong with MERGE?

Nothing is wrong with using MERGE tobridge information together from multiplefiles. It is usually one of the first options thatcome to mind. The general technique is toread in the data sets, sort them by thevariable you would like to merge on thenexecute the merge. Example 1A belowshows a match merge between two datasets. One contains zip codes and states, theother contains zip codes and cities. Thegoal is to attach a city to each zip code andstate.

Example 1A:/* Read data set #1 – zip/state file */DATA STATES; LENGTH STATE $ 2 ZIP $ 5; INPUT ZIP $ STATE $;CARDS;08000 NJ10000 NY19000 PARUN;

/* Sort by variable of choice */PROC SORT DATA=STATES; BY ZIP;RUN;

/* Read data set #2 – zip/city file */DATA CITYS; LENGTH ZIP $ 5 CITY $ 15; INPUT ZIP $ CITY $; CARDS;10000 NEWYORK19000 PHILADELPHIA90000 LOSANGELESRUN;

/* Sort by variable of choice */PROC SORT DATA=CITYS; BY ZIP;RUN;

/* Merge by variable of choice */DATA CITY_ST; MERGE STATES CITYS; BY ZIP;RUN;

PROC PRINT DATA= CITY_ST;RUN;

The following output is produced:

OBS STATE ZIP CITY1 NJ 080002 NY 10000 NEWYORK3 PA 19000 PHILADELPHIA4 90000 LOSANGELES

Notice above that the merged data setshows four observations with observation 1having no city value and observation 4

Page 2: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

having no state value. This is because notall of the records from the state file had a zipcode in the city file and not every zip code inthe city file had a matching zip code in thestate file. The way the merge was defined inthe data step allowed for these non-matchesto be included in the output data set.

To only keep matched observations betweenthe two files, the merge step can be alteredusing the IN= data set option where avariable is set to the value of 1 if the data setcontains data for the current observation,i.e., a match in the BY variables. This isshown in the example below.

Example 1B:DATA CITY_ST2; MERGE STATES (IN=A) CITYS (IN=B); BY ZIP; IF A AND B;RUN;

PROC PRINT DATA= CITY_ST2;RUN;

The following output is produced:OBS STATE ZIP CITY

1 NY 10000 NEWYORK2 PA 19000 PHILADELPHIA

The two observations above were the onlytwo situations where each data set in theMERGE statement contained common BYvariables, ZIP=10000 and ZIP=19000.Specifying IF A AND B is telling SAS to keepobservations that have by variables in boththe STATES (a.k.a. A) and CITYS (a.k.a. B)files. Similarly you could includeobservations with ZIPs in one file but not inthe other by saying either

IF A AND NOT B; orIF B AND NOT A;

Introduction to PROC SQL

A similar result is possible using PROCSQL. Before showing an example of howthis is done, a brief description of SQL andhow it works is necessary. SQL (StructuredQuery Language) is a very popular languageused to create, update and retrieve data inrelational databases. Although this is astandardized language, there are manydifferent flavors of SQL. While most of the

different versions are nearly identical, eachcontains its own “dialect” which may notwork the same from one platform to another.These intricacies are not an issue whenusing PROC SQL in SAS.

For the purpose of this paper using PROCSQL does not require being an expert inSQL or relational theory. The more youknow about these topics the more powerfulthings you can do with PROC SQL. In myexamples it is only required that you learn afew basic things about the SQL languageand database concepts.

In SQL/database jargon we think of columnsand tables where in SAS we refer to them asvariables and data sets. Extracting datafrom a SAS data set is analogous, in SQLtalk, to querying a table and instead ofmerging in SAS we perform “joins” in SQL.

The key to replicating a MERGE in PROCSQL lies in the SELECT statement. TheSELECT statement defines the actual query.It tells which columns (variables) you wish toselect and from which tables (data sets) youwant them selected under certain satisfiedconditions. The basic PROC SQL syntax isas follows:

PROC SQL NOPRINT; CREATE TABLE newtable AS SELECT col1, col2, col3 FROM table1, table2 WHERE some condition is satisfied;

The first line executes PROC SQL. Thedefault for PROC SQL is to automaticallyprint the entire query result. Use theNOPRINT option to suppress the printing. Ifyou need the data from the query result in adata set for future processing you shoulduse the CREATE TABLE option as shownon line 2 and specify the name of the newdata set. Otherwise, the CREATE TABLEstatement is not necessary as long as youwant the output to print automatically.

Lines 2 through 5 are all considered part ofthe SELECT statement. Notice that thesemicolon only appears at the very end. Inthe SELECT statement we define thevariables we wish to extract separated bycommas. If all variables are desired simplyput an asterisk (*) after SELECT. Be careful

Page 3: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

if the data sets you are joining contain thesame variable name. This will be explainedmore in the example to follow. The FROMclause tells which data sets to use in yourquery. The WHERE clause contains theconditions that determine which records arekept. The WHERE clause is not necessaryfor PROC SQL to work in general but as weare using PROC SQL to replicate a merge itwill always be necessary. Also note that theRUN statement is not needed for PROCSQL.

The results from the merge in example 1Bcan be replicated with PROC SQL byeliminating the two PROC SORT steps, theMERGE data set step the PROC PRINTstep and replacing it with the following code:

Example 2A:PROC SQL; SELECT A.ZIP, A.STATE, B.CITY FROM STATES AS A, CITYS AS B WHERE A.ZIP=B.ZIP;

This results in:STATE ZIP CITY

NY 10000 NEWYORKPA 19000 PHILADELPHIA

Notice that no CREATE TABLE line wasused since all we did in our example wasprint the data set. If that’s all that is neededthen there is no use in creating a new dataset just to print it if PROC SQL can do it. Ifdesired, titles can be defined on the outputthe same as any other SAS procedure.

Also notice in the SELECT statement thateach variable is preceded by A or B and aperiod. All variables in the SELECTstatement can be expressed as a two-levelname separated by a period. The first level,which is optional if the variable nameselected is unique between tables, refers tothe table or data set name. The second isthe variable name itself. In the example,instead of using the table name in the firstlevel I used a table alias. The table aliasesget defined in the FROM clause in line 3. Sothe FROM clause not only defines whichdata sets to use in the query but in example2A we also use it to define an alias for eachdata set. We are calling the STATES dataset “A” and the CITYS data set “B” (the AS

keyword is optional). Aliases are useful as ashorthand way of selecting variables in yourquery when several data sets contain thesame variable name. In our example bothdata sets contain the variable ZIP. If wesimply asked for the variable ZIP withoutspecifying which data set to use, the printout would show both ZIP variables with noindication of which data set either one camefrom. If we used a CREATE clause withoutspecifying which ZIP to use, an error wouldoccur because SAS will not allow duplicatevariable names on the same data set.

We could also code the SELECT statementseveral different ways, with and without twolevel names and aliases as shown below.

• Two-level without aliasesSELECT STATES.ZIP,STATES.STATE,CITYS.CITY

• Two-level where necessary:SELECT STATES.ZIP, STATE, CITY

• Two-level with alias where necessary:SELECT A.ZIP, STATE, CITY

All accomplish the same result. It’s a matterof preference for the programmer.

Joins versus Merges

In the last example (1B) we used a type ofjoin called an inner join. Inner joins retrieveall matching rows from the WHERE clause.You may specify up to 16 data sets in yourFROM clause in an inner join. Graphically(using only two data sets) the inner join

looks like this:

There are also joins called outer joins whichcan only be performed on two tables at atime. They retrieve all matching rows plusany non-matching rows from one or both ofthe tables depending on how the join isdefined. There are three common types of

BA

Page 4: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

outer joins: Left joins, right joins and fulljoins. Each type of join is illustrated below:

In the merge step from example 1A, theresult was equivalent to performing a full joinin PROC SQL where we ended up with notonly matching records but also all non-matching records from each data set. ThePROC SQL code would look as follows:

PROC SQL; SELECT A.ZIP, A.STATE, B.CITY FROM STATES A FULL JOIN CITYS B ON A.ZIP=B.ZIP;

Outer joins use a slightly different syntax.An ON clause replaces the WHERE clausebut both serve the same purpose in a query.On the FROM clause, the words FULL JOINreplace a comma to separate the tables.Similarly for left and right joins, the wordFULL would be replaced by LEFT andRIGHT respectively.

Comparing SQL joins with merging, thefollowing IF statements used in the mergeexample would be equivalent to these typesof outer joins:

IF A; Left JoinIF B; Right Join

How Joins Work

In theory, when a join is executed, SASbuilds an internal list of all combinations ofall rows in the query data sets. This isknown as a Cartesian product. So for theSTATES and CITYS data sets in ourexamples, a Cartesian product of these twodata sets would contain 9 observations (3 inSTATES x 3 in CITYS) and would lookconceptually like the table below.

STATE A.ZIP B.ZIP CITYNJ 08000 10000 NEWYORKNJ 08000 19000 PHILADELPHIANJ 08000 90000 LOSANGELES�� ����� ����� ������

NY 10000 19000 PHILADELPHIANY 10000 90000 LOSANGELESPA 19000 10000 NEWYORK� ����� ����� ������ ��

PA 19000 90000 LOSANGELES

The italicized rows would be the onesactually selected in our inner join examplebecause the WHERE clause specified thatthe zips be equal. Conceptually, SAS wouldevaluate this pseudo-table row by rowselecting only the rows that satisfy the joincondition. This seems to be a highlyinefficient way of evaluation but in realitySAS uses some internal optimizations tomake this process more efficient.

Advantages to Using PROC SQL insteadof MERGE.

There are several worthwhile advantages tousing PROC SQL instead of data stepmerging. Using PROC SQL requires fewerlines to code. In example 1B we eliminatedtwo PROC SORTS (6 lines of code), anentire data step (5 lines) and a PROCPRINT (2 lines). We replaced it with 4 linesof PROC SQL code for a net loss of 9 linesof code. It doesn’t sound like much but ifyou normally use MERGE a lot it can add upto something significant. Also, the fewerlines of code the better chance of error freecode.

You also don’t need to have a commonvariable name between the data sets youare joining. This can come in handy if you

A B

Right Join

BA

Full Join

BA

Left Join

Page 5: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

are dealing with SAS data sets created bysomeone who is not aware of how you areusing them. If you were merging these datasets you may need to rename one of thevariables on one of the data sets. TheWHERE/ON clause in PROC SQL lets youspecify each different variable name fromeach data set in the join.

Another benefit is that you do not need tosort each input data set prior to executingyour join. Since sorting can be a ratherresource intensive operation when involvingvery large data sets, replacing a merge witha PROC SQL join can potentially save youprocessing time. Note that factors such asdata set indexes and the size of the datasets can also have effects on performance.Analysis of these different factors was notperformed because it is not within the scopeof this paper. Any references to potentialefficiency improvements are qualitative, notquantitative.

PROC SQL also forces you to be morediligent with variable maintenance. If youMERGE two or more data sets and thesedata sets have common variable names (notincluding the BY variable), the variable onthe output data set will contain the valuesfrom the right most data set listed on theMERGE statement. This may not be whatyou wanted and is a common oversightwhen using MERGE. There is a tendency indata step merging to carry all variables andto not specify only the necessary variables.In a PROC SQL join, if you are creating anew table from two or more data sets andthose data sets have common variablenames, SAS will give an error because it willnot allow more than one variable with thesame name on the output data set. It forcesthe programmer to either include only thenecessary variables in the SELECTstatement or rename them in the SELECTstatement using the following syntax:

SELECT var1 as newvar1, var2 as newvar2

You can rename the variables as they areinput to a MERGE but if you don’t do thisand variable overlaying occurs, there are noerror or warning messages.

Another advantage is that once you becomeaccustomed to using PROC SQL as a

novice it will give you the confidence toexplore many more of its uses. You can usePROC SQL to perform summaries, sortsand to query external databases. It’s a verypowerful procedure.

Things to be careful with in PROC SQL

If you are used to using MERGE and neverhad any SQL experience there are a fewthings you must be aware of so that you fullyunderstand what happens in SQL joins. Theentire Cartesian product concept is usually anew one for most veterans of the merge.Match merging in SAS does not work in thesame manner.

Match merging is best used in situationswhere you are matching one observation tomany, many observations to one, or oneobservation to one. If your input data setscontain repeated BY variables, i.e., many tomany, the MERGE can produce unexpectedresults. SAS prints a warning message inthe SASLOG if that situation exists in yourdata. This is a helpful message because wesometimes think our data is clean butduplication of by variables can exist in eachdata set without our knowledge.

PROC SQL, on the other hand, will alwaysbuild a virtual Cartesian product between thedata sets in the join. Analogously, situationsof duplicate where-clause variables will notbe identified in any way. It’s important tounderstand the data.

Introduction to PROC FORMAT

We can also use PROC FORMAT along withsome help from the PUT function to replicatea merge but first a review of PROCFORMAT.

PROC FORMAT is often used to transformoutput values of variables into a differentform. We use the VALUE statement tocreate user-defined formats. For example,let’s say we have survey response dataconsisting of three questions and all of theanswers to our questions are coded as 1 for“yes”, 2 for “no” and 3 for “maybe”. Runninga frequency on these values would showhow many people answered 1,2 or 3 but it

Page 6: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

would be much nicer if we could actually seewhat these values meant. We could code aPROC FORMAT step like the following:

PROC FORMAT; VALUE ANSWER 1='YES' 2='NO' 3='MAYBE';RUN;

This creates a format called ANSWER. Youcould now run a frequency on the answers toQ1, Q2 and Q3 as follows:

PROC FREQ DATA=ANSWERS; FORMAT Q1 Q2 Q3 ANSWER.; TABLES Q1 Q2 Q3;RUN;

We used the FORMAT statement here toapply the conversion in the PROC step. Theoutput would show frequencies of “YES”,“NO” and “MAYBE” as opposed to theircorresponding values of 1,2 and 3.

Using PROC FORMAT in this manner is fineif you are using a static definition for yourconversion. Imagine now that you have datafrom another survey but this time the answercodes are totally different where 1 is“Excellent”, 2 is “Good”, 3 is “Fair” and 4 is“Poor”. You would have to re-code yourPROC FORMAT statement to reflect thenew conversion. Although it is not a lot ofwork to do here, it can become cumbersomeif you have to do it every day. There is aconvenient way to lessen the maintenance.

Using CNTLIN= optionThere is an option in PROC FORMAT calledCNTLIN= that allows you to create a formatfrom an input control data set rather than aVALUE statement. The data set mustcontain three specially named variables thatare recognized by PROC FORMAT: START,LABEL and FMTNAME. The variablenamed START would be the actual datavalue you are converting from, e.g., 1, 2 or 3.The LABEL variable would be the value youare converting to, e.g., YES, NO, MAYBE.The FMTNAME variable is a character stringvariable that contains the name of theformat, e.g., ‘ANSWER’. You would simplyread in the data set and appropriately assignthe variable names. Then run PROC

FORMAT with the CNTLIN option. See thefollowing example:

DATA ANSFMT; INFILE external filename; INPUT @1 START 1. @3 LABEL $5; FMTNAME='ANSWER';RUN;

/* The external input data set would look like:1 YES2 NO3 MAYBE*/

PROC FORMAT CNTLIN=ANSFMT;RUN;

Both methods create the same format calledANSWER. In this example we would nothave to go into the SAS code every time wehad a new definition of answer codes. Itallows the programmer more flexibility inautomating their program.

Replicating a MERGE with PROCFORMAT using the PUT function.

Sometimes it may become necessary tocreate an entirely new variable based on thevalues in your format. It may not be enoughto simply mask the real answer value with atemporary value such as showing the word“YES” whenever the number 1 appears.You may need to create a new variablewhose permanent value is actually “YES”.This can be done using the PUT functionalong with a FORMAT. Showing how thePUT function is used, we’ll stay with oursurvey example. Our data sets look likethis:

Survey Data: ANSWERSID Q1

111 1222 2333 3

Format Data: ANSFMTSTART LABEL FMTNAME

1 YES ANSWER2 NO ANSWER3 MAYBE ANSWER

Page 7: Alternatives to Merging SAS Data Sets But Be Careful · Alternatives to Merging SAS Data Sets … But Be Careful ... other contains zip codes and cities. The goal is to attach a city

We will create a new variable using the PUTfunction that will be the value of thetransformed Q1 variable. The format of thePUT function is: PUT(variable name,format).An example of the program code would looklike:

DATA ANSWERS; SET ANSWERS; Q1_A = PUT(Q1,ANSWER.);RUN;

The output data set would look like this:

ID Q1 Q1_A111 1 YES222 2 NO333 3 MAYBE

We have in effect “merged” on the Q1_Avariable. An advantage to using this methodin place of the merge would be in a situationwhere we had several variables (Q1, Q2,etc.) that all required the sametransformation. You could just apply anotherPUT statement for each variable rather thanperform multiple merges and renaming ofmerge variables.

Limitations and things to be careful withusing the PUT function.

You must be careful that all expected valuesof your “before” variables have acorresponding value in the format list. If avalue in the data exists that has no matchingvalue in the format data set then the newvariable will retain the value of the “before”variable as a character string. There areother options in PROC FORMAT such asOTHER and _ERROR_ that can be used tohighlight these types of situations. Theseoptions will not be discussed here.

This method is best used when you wouldonly need to merge on one variable fromanother data set. If we had a data setcontaining eight different attributes that wewant added to a master file, it would requirecreating eight CNTLIN data sets and eightPROC FORMATS before executing eightPUT functions. A MERGE or PROC SQLwould be better suited in such a case.

Conclusions

As you can see, there are other alternativesin SAS to using the traditional MERGE whencombining data set information. It is not toodifficult to learn a few beginner lines ofPROC SQL code to execute what mostmerges do. The PROC FORMAT / CNTLIN/ PUT method is also a quick way to getlimited information from other data sets.Hopefully you will try out these new methodsand expand you SAS knowledge in theprocess.

References

SAS Language Reference, Version 6, FirstEdition.

SAS Procedures Guide, Version 6, ThirdEdition.

SAS Guide to the SQL Procedure, Version6, First Edition.SAS is a registered trademark of the SASInstitute, Inc., Cary, NC USA.

The Practical SQL Handbook, Third Edition;Bowman, Emerson, Darnovsky

Acknowledgements

Thanks to Perry Watts, Sr. StatisticalProgrammer, IMS Health and John Gerlach,SAS consultant at IMS Health, for providingvarious ideas, suggestions and technicalassistance.

Author Information

Michael WieczkowskiIMS HEALTH660 West Germantown PikePlymouth Meeting, PA 19462Email: [email protected]