Upload
rosanna-greene
View
213
Download
0
Embed Size (px)
Citation preview
The Power of The Power of the the BYBY Statement Statement
SVSUG 2009.06.25SVSUG 2009.06.25
Paul Choate, California Developmental ServicesPaul Choate, California Developmental Services
(& Toby Dunn, U.S. Army Medical (& Toby Dunn, U.S. Army Medical Department Center & School)Department Center & School)
BY Statement Syntax and UsageBY Statement Syntax and Usage
The BY statement is used in SAS to instruct the DATA step or The BY statement is used in SAS to instruct the DATA step or procedures to process dataset observations in groups, rather than procedures to process dataset observations in groups, rather than singly. It can be used whenever SAS data is ordered, or can be singly. It can be used whenever SAS data is ordered, or can be accessed in order through a SAS dataset index. accessed in order through a SAS dataset index.
In the DATA step this allows observations to be summarized or In the DATA step this allows observations to be summarized or reorganized according to a group structure. In PROC steps it reorganized according to a group structure. In PROC steps it allows SAS to process and present data in groups.allows SAS to process and present data in groups.
The basic syntax of the BY statement is the same throughout SAS, The basic syntax of the BY statement is the same throughout SAS, with the exception that the GROUPFORMAT option is only with the exception that the GROUPFORMAT option is only available in the DATA step.available in the DATA step.
BY <DESCENDING> var1 <...<DESCENDING> varn> BY <DESCENDING> var1 <...<DESCENDING> varn> <NOTSORTED> <GROUPFORMAT>;<NOTSORTED> <GROUPFORMAT>;
BY Statement Syntax and UsageBY Statement Syntax and Usage
BY Sex Age Name;BY Sex Age Name;
NAME SEX AGE HEIGHT WEIGHTNAME SEX AGE HEIGHT WEIGHT
Alice F 13 56.5 84Alice F 13 56.5 84
Barbara F 13 65.3 98Barbara F 13 65.3 98
Carol F 14 62.8 102.5Carol F 14 62.8 102.5
Judy F 14 64.3 90Judy F 14 64.3 90
Jeffrey M 13 62.5 84Jeffrey M 13 62.5 84
Alfred M 14 69 112.5Alfred M 14 69 112.5
Ronald M 15 67 133Ronald M 15 67 133
Philip M 16 72 150Philip M 16 72 150
BY Statement Syntax and UsageBY Statement Syntax and Usage
Sort orderSort order is platform dependent and is based on is platform dependent and is based on the internal ordering of the platform character the internal ordering of the platform character set, called the set, called the collating sequencecollating sequence. .
ASCIIASCII (PC) character set order: (PC) character set order:
..., 1, 2, 3, ... A, B, C, ... a, b, c ... ..., 1, 2, 3, ... A, B, C, ... a, b, c ...
EBCDICEBCDIC (MVS) character set order: (MVS) character set order:
..., a, b, c, ... A, B, C, ... 1, 2, 3, ... ..., a, b, c, ... A, B, C, ... 1, 2, 3, ...
BY Statement Syntax and UsageBY Statement Syntax and Usage
BY Sex BY Sex DESCENDINGDESCENDING Age Name; Age Name;
NAME SEX AGE HEIGHT WEIGHTNAME SEX AGE HEIGHT WEIGHT
Janet F 15 62.5 112.5Janet F 15 62.5 112.5
Carol F 14 62.8 102.5Carol F 14 62.8 102.5
Alice F 13 56.5 84Alice F 13 56.5 84
Barbara F 13 65.3 98Barbara F 13 65.3 98
Philip M 16 72 150Philip M 16 72 150
Alfred M 14 69 112.5Alfred M 14 69 112.5
Henry M 14 63.5 102.5Henry M 14 63.5 102.5
BY Statement Syntax and UsageBY Statement Syntax and Usage
BY Age BY Age NOTSORTEDNOTSORTED;;
NAME AGE HEIGHT WEIGHTNAME AGE HEIGHT WEIGHT
Carol 14 62.8 102.5Carol 14 62.8 102.5
Judy 14 64.3 90Judy 14 64.3 90
Janet 15 62.5 112.5Janet 15 62.5 112.5
Ronald 15 67 133Ronald 15 67 133
Mary 15 66.5 112Mary 15 66.5 112
Alice 13 56.5 84Alice 13 56.5 84
Jeffrey 13 62.5 84Jeffrey 13 62.5 84
BY Statement Syntax and UsageBY Statement Syntax and Usage
PROC FORMAT;PROC FORMAT; VALUE $Initials 'A'-<'B'='A'VALUE $Initials 'A'-<'B'='A' 'B'-<'C'='B''B'-<'C'='B' ......RUN;RUN;
DATA Class;DATA Class; SET Class;SET Class; FORMAT Name $Initials.;FORMAT Name $Initials.; BY Name BY Name GROUPFORMATGROUPFORMAT; ; RUN;RUN;
BY Statement Syntax and UsageBY Statement Syntax and Usage
GROUPFORMAT GROUPFORMAT (cont.)(cont.)
NAME AGE HEIGHT WEIGHTNAME AGE HEIGHT WEIGHT
Alice 13 56.5 84Alice 13 56.5 84
Alfred 14 69 112.5Alfred 14 69 112.5
Judy 14 64.3 90Judy 14 64.3 90
Janet 15 62.5 112.5Janet 15 62.5 112.5
Jeffrey 13 62.5 84Jeffrey 13 62.5 84
William 15 66.5 112William 15 66.5 112
Introduction to Data StructureIntroduction to Data Structure
Variables may be divided into two classes: Variables may be divided into two classes:
• primary key variablesprimary key variables, whose values may be , whose values may be combined uniquely to identify one observation or combined uniquely to identify one observation or event, and event, and
• non-primary keysnon-primary keys, whose values cannot be , whose values cannot be combined to uniquely identify an observation. combined to uniquely identify an observation.
The primary and non-primary keys are all related to The primary and non-primary keys are all related to each other in some fashion known as each other in some fashion known as functional functional dependenciesdependencies. .
Primary keys must be uniquePrimary keys must be unique or or form unique form unique combinationscombinations called called ccomposite keysomposite keys..
Introduction to Data StructureIntroduction to Data Structure
The most fundamental rule is that The most fundamental rule is that no two rows shall have no two rows shall have the same unique values for all primary key variablesthe same unique values for all primary key variables. .
VEHICLETYPE MODEL MAKE YEAR COLORVEHICLETYPE MODEL MAKE YEAR COLORTruck 1500 Chevy 2008 BlueTruck 1500 Chevy 2008 BlueTruck 1500 Chevy 2008 BlueTruck 1500 Chevy 2008 Blue
should be reduced to:should be reduced to:
VEHICLETYPE MODEL MAKE YEAR COLOR COUNTVEHICLETYPE MODEL MAKE YEAR COLOR COUNTTruck 1500 Chevy 2008 Blue 2Truck 1500 Chevy 2008 Blue 2
Introduction to Data StructureIntroduction to Data Structure
Each variable in the dataset should have Each variable in the dataset should have atomic valuesatomic values..
VEHICLE MODEL YEAR COLOR NUMSOLD PACKAGEVEHICLE MODEL YEAR COLOR NUMSOLD PACKAGETruck 1500 2008 Blue 2 Sports, StandardTruck 1500 2008 Blue 2 Sports, StandardTruck 1500 2008 Gold 3 Sports, Sports, StandardTruck 1500 2008 Gold 3 Sports, Sports, Standard
should be restructured as:should be restructured as:
VEHICLE MODEL YEAR COLOR NUMSOLD PACKAGEVEHICLE MODEL YEAR COLOR NUMSOLD PACKAGETruck 1500 2008 Blue 1 SportsTruck 1500 2008 Blue 1 SportsTruck 1500 2008 Blue 1 StandardTruck 1500 2008 Blue 1 StandardTruck 1500 2008 Gold 2 SportsTruck 1500 2008 Gold 2 SportsTruck 1500 2008 Gold 1 StandardTruck 1500 2008 Gold 1 Standard
This is called This is called First Normal Form with RedundanciesFirst Normal Form with Redundancies..
BY Statement in the Data StepBY Statement in the Data Step
The BY statement provides two automatic temporary The BY statement provides two automatic temporary variables for each BY variable: variables for each BY variable: FIRST.variableFIRST.variable and and LAST.variableLAST.variable..
They indicate whether an observation is:They indicate whether an observation is:
• the first in a BY groupthe first in a BY group• the last in a BY groupthe last in a BY group• neither the first nor the last in a BY groupneither the first nor the last in a BY group• both first and last, as is the case when there is only both first and last, as is the case when there is only
one observation in a BY group.one observation in a BY group.
BY Statement in the Data StepBY Statement in the Data Step
SEXSEX AGEAGE FIRST.SEXFIRST.SEX LAST.SEXLAST.SEX FIRST.AGEFIRST.AGE LAST.AGELAST.AGE
FF 1313 11 00 11 00
FF 1313 00 00 00 11
FF 1515 00 00 11 00
FF 1515 00 11 00 11
MM 1313 11 00 11 11
MM 1414 00 00 11 00
MM 1414 00 00 00 11
M M 1616 00 11 11 11
BY Sex Age;BY Sex Age;
BY Statement in the Data StepBY Statement in the Data Step
SEXSEX AGEAGE FIRST.SEXFIRST.SEX LAST.SEXLAST.SEX FIRST.AGEFIRST.AGE LAST.AGELAST.AGE
FF 1313 11 00 11 11
FF 1515 00 11 11 11
MM 1313 11 00 11 11
MM 1414 00 00 11 11
M M 1616 00 11 11 11
Sorted variables with unique values have all FIRST.variable and LAST.variables Sorted variables with unique values have all FIRST.variable and LAST.variables set to 1set to 1. .
Here Age is unique within Sex:Here Age is unique within Sex:
BY Sex Age;BY Sex Age;
BY Statement in the Data StepBY Statement in the Data Step
Examples:Examples:
• Unduplication exampleUnduplication example
• Counting records exampleCounting records example
Combining DatasetsCombining Datasets
In the DATA step the BY statement is In the DATA step the BY statement is used for combining data with:used for combining data with:
• InterleavingInterleaving with the with the SETSET statement statement• Match-mergingMatch-merging with the with the MERGEMERGE statement statement• UpdatingUpdating with the with the UPDATEUPDATE statement statement
andand• ModifyingModifying (beyond scope of presentation)(beyond scope of presentation)
Interleaving DatasetsInterleaving Datasets
When a BY statement is used with a SET statement that When a BY statement is used with a SET statement that specifies specifies two or more datasetstwo or more datasets, the DATA step reads , the DATA step reads the files simultaneously, alternating between the the files simultaneously, alternating between the files based on the BY variable order. This files based on the BY variable order. This maintains maintains the sort order of the data the sort order of the data from the datasets as they from the datasets as they are processed.are processed.
For example, suppose there are two datasets, one for For example, suppose there are two datasets, one for males and one for females, and both are sorted on males and one for females, and both are sorted on Age. They can be interleaved into a single dataset Age. They can be interleaved into a single dataset sorted on Age and Gender. sorted on Age and Gender.
Interleaving DatasetsInterleaving Datasets
Example:Example:
• SET statement interleaving exampleSET statement interleaving example
Interleaving DatasetsInterleaving Datasets
With With interleavinginterleaving, the sum of a variable for , the sum of a variable for each by-group may be attached back to the each by-group may be attached back to the original non-aggregated dataset.original non-aggregated dataset.
This requires at least two passes of the data, This requires at least two passes of the data, but the efficiency and complexity may vary but the efficiency and complexity may vary considerably based on the approach. considerably based on the approach.
Interleaving DatasetsInterleaving Datasets
Example:Example:
• Howard Shreier look-ahead processingHoward Shreier look-ahead processing
Wookie One-LinersWookie One-LinersYou?! It was your idea for Jar Jar?! You?! It was your idea for Jar Jar?!
And Lando never suggested a flea And Lando never suggested a flea bath again.bath again.
"I just need one head to finish my "I just need one head to finish my C3PO"C3PO"
Allright, allright. I promise: No more Allright, allright. I promise: No more Colt-45 commercials! Colt-45 commercials!
What do you mean,"We're OUT of What do you mean,"We're OUT of shampoo??!!!!" shampoo??!!!!"
Match-Merging Datasets Match-Merging Datasets
When a BY statement is used with a MERGE statement, When a BY statement is used with a MERGE statement, the SAS datasets are read simultaneously, merging the SAS datasets are read simultaneously, merging observations based on matching BY variables.observations based on matching BY variables.
When merging multiple datasets, usually When merging multiple datasets, usually at least all but at least all but one of the datasets should be uniqueone of the datasets should be unique on the BY on the BY variables.variables.
The combined unique observations are merged with The combined unique observations are merged with each matching observation in the non-unique each matching observation in the non-unique dataset. The unique observations are duplicated dataset. The unique observations are duplicated across the non-unique observations. across the non-unique observations.
Match-Merging Datasets Match-Merging Datasets
Example:Example:
• MERGE statementMERGE statement
Updating DatasetsUpdating Datasets
The UPDATE statement only allows two datasets, a The UPDATE statement only allows two datasets, a mastermaster dataset and a dataset and a transactiontransaction dataset. The dataset. The master dataset is specified first and the transaction master dataset is specified first and the transaction dataset second, followed by a BY statement. dataset second, followed by a BY statement.
As with MERGE, the two datasets are read As with MERGE, the two datasets are read simultaneously, updating observations from the simultaneously, updating observations from the master dataset with observations from the master dataset with observations from the transaction dataset based on the lowest level transaction dataset based on the lowest level groupings of the BY variables.groupings of the BY variables.
When a transaction variable has a missing value, by When a transaction variable has a missing value, by default UPDATE default UPDATE does not overwrite the value in the does not overwrite the value in the master datasetmaster dataset, whereas the MERGE statement does., whereas the MERGE statement does.
Updating DatasetsUpdating Datasets
Examples:Examples:
• Updating prices in an inventoryUpdating prices in an inventory
• Flattening a datasetFlattening a dataset
Do-Loop of Whitlock (DoW)Do-Loop of Whitlock (DoW)
The SET statement may be wrapped inside a DO UNTIL The SET statement may be wrapped inside a DO UNTIL loop with the BY statement controlling the loop. loop with the BY statement controlling the loop.
DATA ...;DATA ...;
<Stuff done before break-event>;<Stuff done before break-event>;
DO <Index Specs> UNTIL <Break-Event>;DO <Index Specs> UNTIL <Break-Event>;
SET ...;SET ...;
By ...;By ...;
<Stuff done for each record>;<Stuff done for each record>;
END;END;
<Stuff done after break-event...>;<Stuff done after break-event...>;
RUN;RUN;
Do-Loop of Whitlock (DoW)Do-Loop of Whitlock (DoW)
The DoW works with the The DoW works with the natural executionnatural execution of of the DATA step by isolating what happens the DATA step by isolating what happens between two consecutive break events. between two consecutive break events.
Statements and functions are placed within the Statements and functions are placed within the loop, and the implicit action of the DATA step loop, and the implicit action of the DATA step resets calculated values to missingresets calculated values to missing after after each BY group. each BY group.
In our example the break events are BY groups, In our example the break events are BY groups, but in other cases but in other cases could be anything that could be anything that triggerstriggers the DO loop to stop. the DO loop to stop.
Do-Loop of Whitlock (DoW)Do-Loop of Whitlock (DoW)
Examples:Examples:
• Standard DATA stepStandard DATA step
• Whitlock/Dorfman DoWWhitlock/Dorfman DoW
• Sequential DoWs Sequential DoWs
The BY Statement in The BY Statement in SAS ProceduresSAS Procedures
Nearly all SAS PROCs that process datasets allow for Nearly all SAS PROCs that process datasets allow for the BY statement.the BY statement.
The syntax is the same as in the DATA step, except for The syntax is the same as in the DATA step, except for the GROUPFORMAT option which is only available to the GROUPFORMAT option which is only available to the DATA step.the DATA step.
Procedures that produce printed output, such as Procedures that produce printed output, such as PROC PRINT, format printed output into BY groups.PROC PRINT, format printed output into BY groups.
Procedures that summarize datasets, like PROC FREQ Procedures that summarize datasets, like PROC FREQ or PROC SUMMARY process the data in groups, or PROC SUMMARY process the data in groups, sometimes as an alternative to other statements sometimes as an alternative to other statements such as TABLES or CLASS. such as TABLES or CLASS.
The PRINT ProcedureThe PRINT Procedure
PROC PRINT writes dataset values in columnar PROC PRINT writes dataset values in columnar table form with the variable names or labels table form with the variable names or labels at the top of each column. at the top of each column.
The BY statement, and the related PAGEBY and The BY statement, and the related PAGEBY and SUMBY statements can be used with PROC SUMBY statements can be used with PROC PRINT. PRINT.
The PRINT ProcedureThe PRINT Procedure
Examples:Examples:
• BY statementBY statement
• BY statement with ID statement BY statement with ID statement
• PAGEBY statementPAGEBY statement
• SUMBY statementSUMBY statement
The FREQ ProcedureThe FREQ Procedure
The FREQ procedure calculates frequencies and statistics The FREQ procedure calculates frequencies and statistics on discrete variables. These can be printed or output.on discrete variables. These can be printed or output.
Levels of a tabulation are requested with a TABLES Levels of a tabulation are requested with a TABLES statement, or for sorted variables with a BY statement. statement, or for sorted variables with a BY statement.
PROC FREQ does not show rows or columns for missing PROC FREQ does not show rows or columns for missing categories of a variable in a BY group, but in the TABLE categories of a variable in a BY group, but in the TABLE statement the row or column is zero filled.statement the row or column is zero filled.
The BY and TABLE statements produce different statistics The BY and TABLE statements produce different statistics for tabulation levels with missing categories. for tabulation levels with missing categories.
The FREQ ProcedureThe FREQ Procedure
Examples:Examples:
• PROC FREQ with the TABLES statementPROC FREQ with the TABLES statement
• PROC FREQ with the BY statementPROC FREQ with the BY statement
The SUMMARY or The SUMMARY or MEANS ProcedureMEANS Procedure
In PROC SUMMARY and PROC MEANS the BY statement In PROC SUMMARY and PROC MEANS the BY statement is an alternate to the CLASS statement. is an alternate to the CLASS statement.
All permutations of levels of CLASS variables are All permutations of levels of CLASS variables are summarized. For three class variables A, B, and C, summarized. For three class variables A, B, and C, statistics are calculated for the overall data and all levels statistics are calculated for the overall data and all levels of A, B, C, A*B, A*C, B*C, and A*B*C. of A, B, C, A*B, A*C, B*C, and A*B*C.
Sorted variables may be alternatively specified in a BY Sorted variables may be alternatively specified in a BY statement, but only permutations including that variable statement, but only permutations including that variable will be summarized. will be summarized.
For example, if A is specified in the BY statement rather For example, if A is specified in the BY statement rather than the CLASS statement, then only statistics for A, than the CLASS statement, then only statistics for A, A*B, A*C, and A*B*C are produced. A*B, A*C, and A*B*C are produced.
The SUMMARY or The SUMMARY or MEANS ProcedureMEANS Procedure
Examples:Examples:
• PROC SUMMARY with the CLASS statementPROC SUMMARY with the CLASS statement
• PROC SUMMARY with the BY statementPROC SUMMARY with the BY statement
The SQL ProcedureThe SQL Procedure
PROC SQL performs actions both similar to the DATA step PROC SQL performs actions both similar to the DATA step and summarizing procedures such as SUMMARY, and summarizing procedures such as SUMMARY, TABULATE, and UNIVARIATE.TABULATE, and UNIVARIATE.
PROC SQL has unique syntax conforming to the SQL PROC SQL has unique syntax conforming to the SQL programming language.programming language.
The BY statement in PROC SQL is replaced by the The BY statement in PROC SQL is replaced by the GROUP BY statement. GROUP BY statement.
In PROC SQL if data are not sorted then the procedure will In PROC SQL if data are not sorted then the procedure will sort the data internally as needed by the GROUP BY sort the data internally as needed by the GROUP BY statement. statement.
The SQL ProcedureThe SQL Procedure
Example:Example:
• Aggregating grouped data with PROC SQL Aggregating grouped data with PROC SQL GROUP BY statementGROUP BY statement
Thanks to SVSUG ChairThanks to SVSUG Chair
Andrew KarpAndrew Karp
Contact InformationContact Information
Your comments and questions are valued Your comments and questions are valued and encouraged. and encouraged.
Paul Choate, California Developmental ServicesPaul Choate, California Developmental ServicesPhone: (916) 654-2160Phone: (916) 654-2160E-mail: [email protected]: [email protected]