© 2006, Cognizant Technology Solutions. All Rights Reserved. The information contained herein is subject to change without notice.
SAS Training 101
1
Module 1• Introduction to SAS• Getting/Extracting Data in/from SAS• Working with the Data
Module 2• Introduction to SAS Proc Statements• Combining and Modifying SAS Datasets
Module 3• Proc SQL
• Arrays / DO-END
• Retain / First. Last.
Agenda
Introduction to SAS• Getting Started with SAS environment• The two parts of a SAS program• Reading the SAS Log• SAS Dataset
Getting/Extracting Data in/from SAS• SAS Data Libraries• Importing Data• Exporting Data
Working with the Data• Data Step OPTIONS• Using IF-THEN Statements• Using RETAIN and SUM Statements• PROC PRINT and PROC CONTENTS
Agenda – Module 1
A programming environment and language for data manipulation and analysis
Data Warehousing - Easily access, manage and analyze data from many sources
Analytical Solutions - From simple to advanced statistics
Business Solutions - Manages and reports on data from many sources
What is SAS ?
Interactive windows enable interface with SAS
Navigating SAS Windowing Environment
Contains reports generated by
SAS procedures and DATA steps
Contains reports generated by
SAS procedures and DATA steps
View SAS Datasets
View SAS Datasets
Execute the SAS
Program
Execute the SAS
Program
Getting Started With SAS
Write Programs
Write Programs
Contains information about the processing of this SAS program,
including warning and error messages
Contains information about the processing of this SAS program,
including warning and error messages
Contains reports generated by
SAS procedures and DATA steps
Contains reports generated by
SAS procedures and DATA steps
Select the Explorer tab in the SAS window bar to open the Explorer window
Functionality of the SAS explorer is similar to explorers for window-based systems
Select view explorer
Expand and collapse directories on the left. Drill-down and open specific files in the right
Right-click on a SAS dataset and select properties
– Provides general information about the dataset
Double click on the dataset to open it in VIEWTABLE window
– Can be used to edit datasets, create datasets and customize view of a SAS dataset
Exploring SAS Libraries
Select file Open or Click on and select the file D:\Projects\......Click on or Select run submit* to submit the program for execution
Enhanced Editor
Access and edit existing SAS programs
Write new SAS programs
Submit SAS programs
Save SAS programs to a file
* Programs can also be executed without opening them in the SAS environment using batch submit
Open a SAS Program
Running a SAS Program
Accumulates output in the order in which it is generated
Select Edit Clear All to clear the contents of the window
Log and output windows are open by default. These can also be accessed by selecting window Log and window Output respectively
Log Window Output Window
An audit trail of the SAS session Contains programming statements as submitted Contains notes about
Files read Records read Program execution and results
Contains warning and error messages
LOG and OUTPUT windows
Raw DataRaw Data
SAS Data Set
SAS Data Set
Data Step SAS Data Set
SAS Data Set
Proc Step OutputOutput
Data steps are used to CREATE SAS datasets
PROC steps are used to PROCESS SAS datasets
Data steps are used to CREATE SAS datasets
PROC steps are used to PROCESS SAS datasets
A SAS program is a sequence of steps that the user submits for execution
SAS Statements
Usually begin with an identifying keyword Always end with a semicolon Statements that begin with /* and end with */
are treated as comments
SAS Syntax Rules
SAS Statements can be upper/lower case One or more blanks or special characters can
be used separate words They can begin and end in any column A single statement can span multiple lines Several statements can be on the same line
SAS Programs
DATA steps
• Begin with DATA statements
• Read and Modify data
• Create a SAS data
PROC steps
• Begin with PROC statements
• Performs specific analysis or function
• Produces results or reports
PROC steps can create data sets
A step ends when SAS encounters a new statement (DATA or PROC statement ) or RUN
DATA step executes line by line
DATA and PROC steps
Syntax errors include Misspelled keywords
Missing or invalid punctuation
Invalid options
When SAS encounters a syntax error, SAS identifies the error and writes the location and explanation of the error to the SAS log
daat work.staff;infile ‘raw-data-file’;input LastName $ 1-20 FirstName $
21-30JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staffrun;
daat work.staff;infile ‘raw-data-file’;input LastName $ 1-20 FirstName $
21-30JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staffrun;
Diagnosing and Correcting Syntax Errors
Debugging a SAS Program
data work.staff;infile ‘raw-data-file;input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;run;
proc means data=work.staff mean max;class JobTitle;var Salary;
run;
data work.staff;infile ‘raw-data-file;input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;run;
proc print data=work.staff;run;
proc means data=work.staff mean max;class JobTitle;var Salary;
run;
Submitting a SAS Program That Contains Unbalanced Quotes
Open and submit the code where the closing quote for the INFILE statement is missing
Submit the program and browse the SAS log
There are no notes in the SAS log because all the SAS statements after the INFILE statement have become part of the quoted string
To correct the problem in the Windows environment, click the break icon
Select Cancel Submitted Statements in the Tasking Manager window and select ok
Canceling Submitted Statements
SAS Data Sets:
Variab
le n
ames
Variab
le V
alues
The data portion of a SAS dataset is a rectangular table of data values & descriptor portion is the header
LastName FirstName JobTitle Salary
TORRES JAN Pilot 50000
LANGKAMM SARAH Mechanic 80000
SMITH MICHAEL Mechanic 40000
WAGSCHAL NADJA Pilot 77500
TOERMOEN JOCHEN Pilot 65000
Character valuesNumeric values
Variables (Columns) : Correspond to fields of data, and each data column is named
Observations (Rows) : Correspond to records or data lines
SAS Dataset
Variable Names
Can be 32 characters long Can be uppercase, lowercase or mixed-
case. Variable names are not case-sensitive.
Must start with a letter or underscore. Subsequent characters can be letters, underscores or numeric digits (no special character)
Examples Valid names:
Data_5
bad
cub2c3
Invalid names:
Data 5
1bad
count # 5
Variable Values
Variable Types Character: Contain any value, letters, numbers,
special characters, and blanks. Character values are stored with a length of 1 to 32,767 bytes
Numeric: Stored as floating point numbers in 8 bytes of storage by default
Date is stored as a numeric variable in SAS. Conversely, any numeric variable may be interpreted as a date. Internally, a date value is an integer which represents the number of days since January 1, 1960
SAS allows dates to be read and output in various format
Commonly used ones are:
today ( ) function returns the current date A date literal is specified as ‘<formatted date>’ d
e.g. ‘31DEC1959’ d
Stored Value Format Displayed Value
0 MMDDYY8. 01/01/60
0 MMDDYY10. 01/01/1960
-1 DATE9. 31DEC1959
365 DDMMYY10. 31/12/1960
Variable Names and Values
Introduction to SAS• Getting Started with SAS environment• The two parts of a SAS program• Reading the SAS Log• SAS Dataset
Getting/Extracting Data in/from SAS• SAS Data Libraries• Importing Data• Exporting Data
Working with the Data• Data Step OPTIONS• Using IF-THEN Statements• Using RETAIN and SUM Statements• PROC PRINT and PROC CONTENTS
Agenda – Module 1
A SAS data library is a collection of SAS files that are recognized as a unit by SAS
SAS data libraries are identified by assigning a library reference name
On invoking SAS, one automatically has access to a temporary and a permanent SAS data library
Work - Temporary library
SAS user - Permanent library
One can also create and access new permanent libraries
The work library and its SAS data-files are deleted after the SAS session ends
SAS datasets in permanent libraries are saved after the SAS session ends
Libname sample “C:\mysasfiles”;
SAS Data Library - Sample
SAS File
SAS File
SAS File
SAS Data Libraries
data PS_AA_team;
input NAME $ Age prior_work_ex $;
datalines;
Sayaji 40 Y
Vikrant 30 Y
Yashjit 30 Y
Hita . Y
Tuhin 20 N
Sharmila . N
Aditi . N
Shikha . Y
Anirban 30 Y
Lata . Y
Deepak 20 N
Ambrish 20 N
Vaibhav 20 N
;
run;
• Datalines / Cards is used
• Default format of variable is numeric
• Missing value for numeric needs to be entered as “.”
• Default length for character variables is 8
Creating Data
General form of an informat:
$ indicates a character format
informat-name names the informat
w is an optional field width
. is the required delimiter
d optionally, specifies a decimal for numeric informats
$informat-namew.d
Informat statement
7. or 7.0 reads seven columns of numeric data.
7.2 reads seven columns of numeric data and inserts a decimal point in the data value.
$5. reads five columns of character data and removes leading blanks.
$CHAR5. reads five columns of character data and preserves leading blanks.
COMMA7. reads seven columns of numeric data and removes selected nonnumeric characters, such as dollar signs and commas.
MMDDYY10. reads dates of the form 01/20/2000
Selected Informats
List directed input - data must be separated by a delimiter; must read in all variables. In case of delimited data the data values are separated by a specially designated character called the delimiter. For example, in case of comma separated values, the comma separates individual data values from each other.
Column input - data in fixed columns;must know where data starts and ends; can read in selected variables. In fixed format files the data values are placed at pre-specified column addresses in the data file.
Informat - alternative to column input; most flexible; must be used for special data
Input data can have variable names as part of the data values. In case if the data values have the names of the variables specified in the top most row of the file, then one can use PROC IMPORT;
Fixed Format Delimited
Names Available
PROC IMPORT (Use Wizard)
PROC IMPORT
Raw Data
INFILE/INPUT
@ signifies the start of the data
value
INFILE / INPUT
DLM OPTION
Importing Data
Raw FilesInfile “X:\raw-file"
LRECL = <length-of-observation> MISSOVER;
Input @<start-of-var1> var1 <length-of-var1>.
@<start-of-var1> var2 <length-of-var2>.
.
.
@<start-of-var1> var3 $<length-of-var3>.
;
To read a fixed file format raw file, one need to know the exact position from where each of the variables start and length of the variable
For all char variable $ symbol is used while declaring its length
If no $ symbol is used that variable by default is taken as numeric
The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing.
FIRSTOBS = option tells SAS what line to begin reading data
OBS = specifies number of observations to be read
DLM = specifies the delimiter used
Importing Data (Fixed Format / Delimited)
data <dataset>;
infile “X:\YYY.txt"
LRECL = 99 MISSOVER;
input @1 DOCID 9.
@10 SPEC $30.
@40 STREET $25.
@65 CITY $20.
@85 STATE $2.
@87 ZIP $3.
@90 PHONE 10. ;
run;
Convert a fixed format file (YYY.txt) to SAS Dataset.Start End Length Type Variable Description
1 9 9 Num DOCID Doctor ID
10 39 30 Char Spec Speciality
40 64 25 Char STREET Address - Street
65 84 20 Char CITY Address - City
85 86 2 Char STATE Address - State
87 89 3 Char ZIP Address - ZIP
90 99 10 Num PHONE Telephone Number
La
yo
ut
of
YY
Y.t
xt
Example:
PROC IMPORT OUT=SAS-data-set
DATAFILE=‘external-file-name’
DBMS=file-type;
GETNAMES=YES;
RUN;
General form of the IMPORT procedure
PROC IMPORT datafile='D:\fun\Ritesh Training\comp.csv' out=yyy
DBMS=CSV REPLACE;
GETNAMES=YES;
RUN;
Example Code
PROC IMPORT
PROC IMPORT OUT=SAS-data-set
DATAFILE=‘external-file-name’
DBMS=Delimiter REPLACE;
GETNAMES=YES;
RUN;
PROC IMPORT with slight change can read the delimited file. General format is:
PROC IMPORT data = 'D:\fun\Ritesh Training\Broker comp file.txt' out=xxx
DBMS=TAB REPLACE;
GETNAMES=YES;
run;
Example: Following code converts tab delimited file to SAS dataset
Delimited Text Files
Select the type of raw file which is to be imported
Browse to the raw file
Wizard is the a SAS provided graphical interface to convert raw data file to SAS dataset. It can only convert Delimited and Excel files to SAS files.
IMPORT Wizard
Enter the library name and name where you want to save SAS dataset Press “Finish” to convert raw file to SAS dataset
Import Wizard basically first generates PROC IMPORT code and then executes it. You can save the code that the wizard generates.
IMPORT Wizard
The following code segment illustrates the use of the export procedure in SAS to output a filein the csv format.
PROC EXPORT DATA= <Name of Dataset> OUTFILE= <Output Filename> DBMS=CSV REPLACE;
RUN;
Note: The output filename should be given under quotes with the full path
Example Code
SAS dataset can be converted into other file formats by using either “proc export” or the SAS “export wizard”
Exporting Data From SAS
Step 1: Click on file and select “Export Data”
Step 2: Select the Data to be exported
SAS “export wizard” allows us to convert a SAS dataset into other file formats without having to write any code.
EXPORT wizard
The SAS “export wizard” also allows us to save the corresponding “proc export” code
Step 3: Select the file format
Step 4: Specify the output filename and its location
Step 5: Enter the filename to save the code for export
EXPORT wizard
Introduction to SAS• Getting Started with SAS environment• The two parts of a SAS program• Reading the SAS Log• SAS Dataset
Getting/Extracting Data in/from SAS• SAS Data Libraries• Importing Data• Exporting Data
Working with the Data• Data Step OPTIONS• Using IF-THEN Statements• Using RETAIN and SUM Statements• PROC PRINT and PROC CONTENTS
Agenda – Module 1
• SAS language has 3 types of options:
• System options – they have the most global influence (stay in effect for the duration of your
job/session) and affect how SAS operates. They are issued when you invoke SAS or when
you use OPTIONS statement
• Statement options – they appear in individual statements and influence how SAS runs that
particular DATA or PROC step. DATA=, for example, is a statement option telling SAS which
dataset to use for a procedure
• Data set options – they affect only how SAS reads or writes an individual data set. You can
use data set options in DATA or PROC statements. Simply put the option between
parenthesis directly following the data set name. example,
• KEEP = variable list ,
• DROP = variable list,
• RENAME = (oldvar = newvar)
• FIRSTOBS = n
• IN = new_var_name
Using SAS Data Set Options
• PUT Statement is used to convert variables from numeric to character and INPUT
Statement is used for vice-versa
Character to Numeric Numeric to Character
newvar = INPUT (oldvar,informat); newvar = PUT (oldvar,informat);
Character to Numeric Numeric to Character
newB = INPUT (VarB,1.); newD = PUT (VarD,2.);
PUT/INPUT Statement
Basic form:IF Condition THEN action;
If model = ‘Mustang’ Then Make = ‘Ford’;
• You can use symbolic or mnemonic operators
• You may also use the “IN” operator to make comparisons
Example:
If Model IN (‘Corvette’, ‘Camaro’) Then Make = ‘Chevrolet’;
Symbolic Mnemonic= EQ
<>, ^= NE> GT< LT
>= GE<= LE
Using IF-THEN Statements
• Single IF-THEN statement can have only one action. To execute more than one action, add DO and END
Example, If Model = ‘Mustang’ Then DO;
Make = ‘Ford’
Size = ‘Compact’
End;
• Alternatively use AND / OR
Example,If Model = ‘Mustang’ and Year < 1975 Then Status = ‘Classic’;
Using IF-THEN Statements
Basic form:IF condition THEN action;
ELSE IF condition THEN action;
ELSE action;
• Else is automatically executed for all observations failing to satisfy any of the previous IF statements
• Else statement is simply an IF-THEN statement with an ELSE tacked onto the front
Using IF-THEN-ELSE Statements
Data from a survey of home improvements, containing owner’s name, description of work done and cost of improvement. Group the cost into High, Medium, Low.
Gregory cabinet facelift 2000Molly bathroom addition 11350Luther paint exterior 3910Susan second floor 75362.9
Code:
Data home_cost;Infile ‘C:\Home_data.dat’;Input Owner $1-7 Description $9-33 Cost;If Cost < 2000 Then CostGrp = ‘low’;Else if Cost < 10,000 Then CostGrp = ‘medium’;Else CostGrp = ‘high’;Run;
Example
Often you want to use some of the observations of the dataset and exclude the rest
Use IF statement in a DATA step• Basic form: IF expression;• Example:
If sex = ‘f’; If sex = ‘m’ Then delete;
Use IF when it is easier to specify a condition for including observations
Use DELETE when it is easier to specify a condition for excluding variables
Subsetting your data
• When reading raw data, SAS sets the value of all variables equal to missing at the start of each iteration of the DATA step.
• With RETAIN statement a variable is assigned its value from the previous iterations of the DATA step
• Basic form : RETAIN variables;
RETAIN variables initial-value;
• A sum statement also retains values from previous iteration of the DATA step, but you use it for cases where you simple want to cumulatively add the value of an expression to a variable
• Basic form: Variable + expression
Using RETAIN and SUM statements
Data from base ball game containing the date the game was played, team played, hits and run for the game
6-19 Columbia Peaches 8 3
6-20 Columbia Peaches 3 4
7-1 Plains Peanuts 10 5
7-2 Plains Peanuts 2 3
7-4 Sacremento 10 10
7-5 Sacremento 12 8
Team wants two additional variables – cumulative number of runs for the season and maximum number of runs in a game to date.
Example
Data games;
Infile ‘C:\Games.dat’;
Input Month 1 Day 3-4 Team $6-25 Hits 27-28 Runs 30-31;
RETAIN MaxRuns;
MaxRuns = Max (MaxRuns, Runs);
RunsToDate + Runs;
Run;
Example (Contd)..
Questions ??????
Module 1• Introduction to SAS• Getting/Extracting Data in/from SAS• Working with the Data
Module 2• Introduction to SAS Proc Statements• Combining and Modifying SAS Datasets
Module 3• Proc SQL
• Arrays / DO-END
• Retain / First. Last.
Agenda
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Start with the “keyword” – PROCEg :
• PROC CONTENTS DATA = Sales_force_team;
SAS will use the most recently created data if “data” option is not specified
BY statement » “required” for only PROC SORT
» everywhere else SAS performs separate analysis for each combination of BY variables
SAS Procedures
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Default sorting is ascending
Form of PROC SORT statementPROC SORT Data = data-name;
BY variable-1 variable-2 variable-3 … variable-n;
RUN;
NODUPKEY eliminates observation having same value for the BY variablePROC SORT Data = data-name Out = data-name NODUPKEY ;
Sorting in descending
BY variable-1 DESCENDING variable-2 DESCENDING variable-3 ;
PROC SORT
data marine;
input NAME $ FAMILY $ length ; datalines;
beluga whale 15
whale shark 40
basking shark 30
gray whale 50
mako shark 12
sperm whale 60
dwarf shark .5
whale shark 40
humpback . 50
blue whale 100
killer whale 30
;
run;
PROC SORT data = marine out = seasort NODUPKEY ;
BY family DESCENDING length;
PROC PRINT data = seasort;
TITLE ‘ Whales and Sharks’;
run;
Whales and Sharks
Obs Name Family Length1 humpback . 50.02 whale shark 40.03 basking shark 30.04 mako shark 12.05 dwarf shark 0.56 blue whale 100.07 sperm whale 60.08 gray whale 50.09 killer whale 30.010 beluga whale 15.0
OUTPUT
PROC SORT … Example
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Form of PROC MEANS statementPROC MEANS Data = data-name options;
BY variable-list;
VAR variable-list;
RUN ;
If PROC MEANS is used with no other option it gives number of non-missing values, mean, std, min and max for all variables
Writing summary statistic into a SAS dataset
PROC MEANS Data = zoo NOPRINT;
VAR lions tigers bears;
OUTPUT OUT = zoosum MEAN ( lions bears ) = Avglionwt Avgbearwt
SUM ( tigers ) = Tottigerwt;
RUN ;
PROC MEANS
data cake;
input LastName $ 1-12 Age 13-14 PresentScore 16-17 TasteScore 19-20 Flavor $ 23-32 Layers 34 ;
datalines;
Orlando 27 93 80 Vanilla 1
Ramey 32 84 72 Rum 2
Goldston 46 68 75 Vanilla 1
Roe 38 79 73 Vanilla 2
Larsen 23 77 84 Chocolate .
Davis 51 86 91 Spice 3
Strickland 19 82 79 Chocolate 1
Nguyen 57 77 84 Vanilla .
Hildenbrand 33 81 83 Chocolate 1
Byron 62 72 87 Vanilla 2
Sanders 26 56 79 Chocolate 1
Jaeger 43 66 74 1
Davis 28 69 75 Chocolate 2
Conrad 69 85 94 Vanilla 1
Walters 55 67 72 Chocolate 2
Rossburger 28 78 81 Spice 2
Matthew 42 81 92 Chocolate 2
Becker 36 62 83 Spice 2
Anderson 27 87 85 Chocolate 1
Merritt 62 73 84 Chocolate 1;
proc means data=cake n mean max min range std fw=8;
var PresentScore TasteScore;
title 'Summary of Presentation and Taste Scores';
run;
Summary of Presentation and Taste Scores
The MEANS Procedure
Variable N Mean MaximumMinimu
m RangeStd Dev
PresentScoreTasteScore
2020
76.15081.350
93.00094.000
56.00072.000
37.00022.000
9.3766.611
OUTPUT
PROC MEANS … Example
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Form of PROC FREQ statement
PROC FREQ Data = data-name options;
BY variable-list;
OUTPUT statistic-keyword(s) <OUT=SAS-data-set>;
TABLES request(s) </ option(s)>;
RUN ;
To do this Use this statement
Calculate separate frequency or cross-tabulation tables for each BY group BY
Create an output data set that contains specified statistics OUTPUT
Specify frequency or cross-tabulation tables and request tests and measures of association
TABLES
PROC FREQ
data color;
input Region Eyes $ Hair $ Count @@;
label eyes='Eye Color' hair='Hair Color' region='Geographic Region';
datalines;
1 blue fair 23 1 blue red 7 1 blue medium 24
1 blue dark 11 1 green fair 19 1 green red 7
1 green medium 18 1 green dark 14 1 brown fair 34
1 brown red 5 1 brown medium 41 1 brown dark 40
1 brown black 3 2 blue fair 46 2 blue red 21
2 blue medium 44 2 blue dark 40 2 blue black 6
2 green fair 50 2 green red 31 2 green medium 37
2 green dark 23 2 brown fair 56 2 brown red 42
2 brown medium 53 2 brown dark 54 2 brown black 13
;
proc freq data=color;
weight count;
tables eyes hair eyes*hair/out=freqcnt outexpect
sparse;
title 'Eye and Hair Color of European Children';
run;
proc print data=freqcnt noobs;
title2 'Output Data Set from PROC FREQ‘;run;
The TABLES statement requests three tables:
• Eyes and Hair frequencies
• Eyes by Hair cross-tabulation.
OUT = creates FREQCNT data set that contains cross-tabulation table frequencies.
OUTEXPECT stores expected cell frequencies
SPARSE stores zero cell counts in FREQCNT
PROC FREQ … Example
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Form of PROC SUMMARY statement
PROC SUMMARY <option(s)> <statistic-keyword(s)>;
CLASS variable(s) </ option(s)>;
VAR variable(s);
OUTPUT <OUT=SAS-data-set><output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)></ option(s)> ;
RUN;
To do this Use this statement
Calculate separate frequency or crosstabulation tables for each BY group BY
Create an output data set that contains specified statistics OUTPUT
Grouping Variables CLASS
List of variables needs to be summarized VAR
PROC SUMMARY
data color;
input Region Eyes $ Hair $ Count @@;
label eyes='Eye Color' hair='Hair Color' region='Geographic Region';
datalines;
1 blue fair 23 1 blue red 7 1 blue medium 24
1 blue dark 11 1 green fair 19 1 green red 7
1 green medium 18 1 green dark 14 1 brown fair 34
1 brown red 5 1 brown medium 41 1 brown dark 40
1 brown black 3 2 blue fair 46 2 blue red 21
2 blue medium 44 2 blue dark 40 2 blue black 6
2 green fair 50 2 green red 31 2 green medium 37
2 green dark 23 2 brown fair 56 2 brown red 42
2 brown medium 53 2 brown dark 54 2 brown black 13
;
proc summary data=color;
class eyes hair;
var count;
Output out = Summary (drop=_freq_) sum=;
run;
PROC SUMMARY … Example
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
Used to transpose SAS datasets (turning observations into variables or variables into observations)
Basic form
PROC TRANSPOSE DATA = oldname OUT = newname;
BY variable-list;
ID variable;
VAR variable-list;
To do this Use this statement
Used if you have any grouping variables that you want to retain as variables. These variables are included in transposed data set, but are not themselves transposed
BY
Names the variables whose formatted values will become new variable names. In absence of an ID statement, the new variables will be named COL1, COL2, and so on
ID
Names the variables whose values you want to transpose VAR
Changing observations to variables using PROC TRANSPOSE
data color;
input Region Eyes $ Hair $ Count @@;
label eyes='Eye Color' hair='Hair Color' region='Geographic Region';
datalines;
1 blue fair 23 1 blue red 7 1 blue medium 24
1 blue dark 11 1 green fair 19 1 green red 7
1 green medium 18 1 green dark 14 1 brown fair 34
1 brown red 5 1 brown medium 41 1 brown dark 40
1 brown black 3 2 blue fair 46 2 blue red 21
2 blue medium 44 2 blue dark 40 2 blue black 6
2 green fair 50 2 green red 31 2 green medium 37
2 green dark 23 2 brown fair 56 2 brown red 42
2 brown medium 53 2 brown dark 54 2 brown black 13
;
proc transpose data=color out = transpose;
by eyes hair;
id Region;
var count;
run;
PROC TRANSPOSE … Example
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
• To read a SAS data set - start with DATA statement specifying the name of the new SAS data set. Then follow with the SET statement specifying the name of the old SAS dataset you want to read
DATA new-data-set;
SET data-set;
• To stack data sets (appending) – With two or more datasets (that have all or most of the same variables but different observations), in addition to reading the data, the SET statement concatenates the datasets one on top of the other
DATA new-data-set;
SET data-set-1 data-set-n;
Using SET Statement
• The datasets you want to stack are already sorted by some important variable
• Simple stacking would result in unsorting
• Option 1 – Do a simple stacking and then use Proc SORT
• Recommended Option – Use a BY statement with your SET statement
DATA new-data-set;
SET data-set-1 data-set-n;
BY variable-list;
• Before you can use the BY statement, the datasets must be sorted by the BY variables
Interleaving data sets using SET Statement
Introduction to SAS Proc Statements» Proc Sort» Proc Means» Proc Freq» Proc Summary» Proc Transpose
Combining and Modifying SAS Datasets» Set statement» Merge statement
Agenda – Module 2
• First sort all datasets by the common variable(s)
• Basic formDATA new-data-set;
MERGE data-set-1 data-set-n;
BY variable-list;
• If the datasets being merged have variables with same names (besides the BY variables), then the variables from the second dataset will overwrite any variables having the same name in the first data set.
• All observations from both the data sets are included in the final data set, irrespective of whether they had a match or not
One to One Match Merge
• Each observation in dataset 1 matches with more than one observation in dataset 2
• Basic form
DATA new-data-set;
MERGE data-set-1 data-set-n;
BY variable-list;
• The order of the datasets in the MERGE statement does not matter to SAS, i.e., a one to many merge is same as many to one merge
• One to many merge cannot be done without a BY statement. Without any BY variables for matching, SAS simply joins together the first observation from each data set, then the second observation from each data set and so on.
One to Many Match Merge
• We merge the data with certain conditions like: having the data in one file only, the data common to all datasets, the data in one file not present in other
• Basic form
DATA new-data-set;
MERGE data-set-1 (in = a) data-set-2 (in = b);
BY variable-list;
IF condition…..;
• Various Conditions used while merging data sets are:• IF a or b: Union of two datasets
• IF a and b: Intersection of two datasets
• IF a and not b: Data in one file not present in other
Various ways of merging data sets
• Say, you want to compare each observation in a group to the group’s mean
• Summarize your data using PROC MEANS and write the results in a new dataset
• Merge the summarized data back with the original data using a one-to-many match merge
Merging Summary statistics with the original data
• MERGE cannot be used as there are no common variables.
• You can use two SET statements
DATA new-data-set;
IF _N_ = 1 THEN SET summary-data-set;
SET original-data-set;
• Original-dataset is the data with more than one observation and summary data set is the data with a single observation. SAS reads original data set in a normal SET statement. It also reads the summary data set with the SET statement but only in the first iteration of the data step and then retains the value of variables from summary dataset for all observations in new data set
Combining a grand total with the original data
• SAS language has 3 types of options:
• System options – they have the most global influence (stay in effect for the duration of your job/session) and affect how SAS operates. They are issued when you invoke SAS or when you use OPTIONS statement
• Statement options – they appear in individual statements and influence how SAS runs that particular DATA or PROC step. DATA=, for example, is a statement option telling SAS which dataset to use for a procedure
• Data set options – they affect only how SAS reads or writes an individual data set. You can use data set options in DATA or PROC statements. Simply put the option between parenthesis directly following the data set name. Example-
• KEEP = variable list , • DROP = variable list,• RENAME = (oldvar = newvar)• FIRSTOBS = n• IN = new-var-name
Using SAS Data Set Options
• Can be used while combining two datasets, to track which of the original data sets contributed to each observation
• Unlike most variables, IN= variables are temporary, exiting only during the current DATA step
• SAS gives the IN= variables a value of 0 or 1 (1 implying that the dataset did contribute to the current observation and a value of 0 means that it did not)
Tracking and selecting observations with the IN = Option
• To create multiple datasets in a single DATA step, simply put more than one data set name in your DATA statement
• ExampleDATA lions tigers bears;
• In the above example, SAS would create 3 identical data sets
• To create different datasets, use the OUTPUT statement
• Basic formOUTPUT data-set-name;
• ExampleIF family = “Ursidae” then OUTPUT bears;
Writing multiple data sets using the OUTPUT statement
• To write several observations for each pass through the DATA step, put an OUTPUT statement in a DO loop or just use several OUTPUT statements
• Example - Say we want to generate data points for plotting the equation y=x2
DATA generate;
DO x = 1 to 6
Y = x ** 2;
OUTPUT;
END;
• Since the OUTPUT statement is within the DO loop, an observation is created each time through the loop. Without the OUTPUT statement, SAS would have written only one observation at the end of the DATA step
Making several observations from one using the OUTPUT statement
• To do certain modifications or changes to the observations of the data
• To extract certain portion of the data valuenew_variable = SUBSTR (variable, starting text, length of text)
• To check the length of values:new_variable = LENGTH (variable)
• To remove extra spaces within values:new_variable = COMPRESS (variable)
• To extract the data after some special characters like “-”, “(“, “_” etc.new_variable = SCAN (variable, position of special character, special character)
• To extract month, year or day part of dates:new_variable = MONTH (variable) or YEAR (variable)
• When variable has both Date and Time i.e. “23Apr06 00:00:00”, the date part is extracted using:new_variable = DATEPART (variable)
Some useful functions used in SAS
Module 1• Introduction to SAS• Getting/Extracting Data in/from SAS• Working with the Data
Module 2• Introduction to SAS Proc Statements• Combining and Modifying SAS Datasets
Module 3• Proc SQL
• Arrays / DO-END
• Retain / First. Last.
Agenda
Proc SQL
Arrays / DO-END
Retain / First. Last.
Agenda – Module 3
What can SQL do?
» Selecting
» Ordering/sorting
» Subsetting
» Restructuring
» Creating table/view
» Joining/Merging
» Transforming variables
» Editing
PROC SQL – What?
The Advantage of using SQL
» Combined functionality
» Faster for smaller tables
» SQL code is more portable for non-SAS applications
» Not require presorting
» Not require common variable names to join on. (need same type , length)
PROC SQL – Why?
PROC SQL;
SELECT DISTINCT rating FROM MFE.MOVIES;
QUIT;
The simplest SQL code, need 3 statements
By default, it will print the resultant query, use NOPRINT option to suppress this feature.
Begin with PROC SQL, end with QUIT; not RUN;
Need at least one SELECT… FROM statement
DISTINCT is an option that removes duplicate rows
Selecting Data
PROC SQL ;
SELECT *
FROM MFE.MOVIES
ORDER BY category;
QUIT;
Remember the placement of the SAS statements has no effect; so we can put the middle statement into 3 lines
SELECT * means we select all variables from dataset MFE.MOVIES
Put ORDER BY after FROM
We sort the data by variable “category”
Ordering/Sorting Data
PROC SQL;
SELECT title, category
FROM MFE.MOVIES
WHERE category CONTAINS 'Action';
QUIT;
Use comma (,) to separate selected variables
CONTAINS in WHERE statement only for character variables
Also try WHERE UPCASE(category) LIKE '%ACTION%';
Use wildcard char. Percent sign (%) with LIKE operator.
Sub-Setting DataCharacter searching in WHERE
PROC SQL;
SELECT title, category, rating
FROM MFE.MOVIES
WHERE category =* 'Drana';
QUIT;
Always Put WHERE after FROM
Sounds like operator =*
Search movie title for the phonetic variation of “drama”, also help possible spelling variations
Sub-Setting DataPhonetic Matching in WHERE
PROC SQL;
CREATE TABLE ACTION AS
SELECT title, category
FROM MFE.MOVIES
WHERE category CONTAINS 'Action';
QUIT;
CREATE TABLE … AS can always be in front of SELECT … FROM statement to build a sas file.
In SELECT, the results of a query are converted to an output object (printing). Query results can also be stored as data. The CREATE TABLE statement creates a table with the results of a query. The CREATE VIEW statement stores the query itself as a view. Either way, the data identified in the query can beused in later SQL statements or in other SAS steps.
Produce a new dataset (table) ACTION in work directory, no printing
Creating New DataCreate Table
PROC SQL;
SELECT *
FROM MFE.CUSTOMERS, MFE.MOVIES;
QUIT;
Terminology: Join (Merge) datasets (tables)
No prior sorting required – one advantage over DATA MERGE
Use comma (,) to separate two datasets in FROM
Without WHERE, all possible combinations of rows from each tables is produced, all columns are included
Turn on the HTML result option for better display: Tool/Options/Preferences…/Results/ check Create HTML/OK
Join Tables (Merge datasets)Cartesian Join
PROC SQL;
SELECT *,
COUNT(title) AS notitle,
MAX(year) AS most_recent,
MIN(year) AS earliest,
SUM(length) AS total_length,
NMISS(rating) AS nomissing
FROM MFE.MOVIES
GROUP BY rating;
QUIT;
Simple summarization functions available
All function can be operated in GROUPs
Transforming DataSummarizing Data using SQL functions
Proc SQL
Arrays / DO-END
Retain / First. Last.
Agenda – Module 3
You can use arrays to simplify programs that
» perform repetitive calculations
» create many variables with the same attributes
» read data
» rotate SAS data sets by making variables into observations or observations into variables
» compare variables
» perform a table lookup.
Array Processing
An array in SAS provides a means for repetitively processing variables using a do-loop. Arrays are merely a convenient way of grouping variables, and do not persist beyond the data step in which they are used
SAS arrays can be used for simple repetitive tasks, reshaping data sets, and remembering values from observation-to-observation
Arrays can be used to allow some traditional matrix-style programming techniques to be used in the data step
In short a SAS array » is a temporary grouping of SAS variables that are arranged in a particular order
» is identified by an array name
» exists only for the duration of the current DATA step
» is not a variable.
Each value in an array is» called an element
» identified by a subscript that represents the position of the element in the array.
What Is a SAS Array?
ARRAY name<fnelemg> <$> <<elements <(initial-values)>>;
Examples:•array x x1-x3;•array check{5} _temporary_;•array miss{4} _temporary_ (9 9 99 9);•array dept $ dept1-dept4 ('Sales',‘ Research', ‘Training');•array value{3}; * generates value1, value2 and value3;
All variables in an array must have the same type (numeric or character)
An array name can't have the same name as a variable
You must explicitly state the number of elements when using _temporary_; in other cases SAS figures it out from context, generating new variables if necessary.
Array Statement: Syntax
...
D
ID QTR4QTR2 QTR3QTR1
CONTRIBCONTRIB
Firstelement
Secondelement
Thirdelement
Fourthelement
Array references
CONTRIB{1} CONTRIB{2} CONTRIB{3} CONTRIB{4}
Array name
What is a SAS Array?
The ARRAY statement defines the elements in an array. These elements will be processed as a group. You refer to elements of the array by the array name and subscript.
ARRAY array-name {subscript} <$> <length> <array-elements> <(initial-value-list)>;
ARRAY array-name {subscript} <$> <length> <array-elements> <(initial-value-list)>;
The ARRAY Statement
The ARRAY statement
» must contain all numeric or all character elements
» must be used to define an array before the array name can be referenced
» creates variables if they do not already exist in the PDV
» is a compile-time statement.
The ARRAY Statement
Write an ARRAY statement that defines the four quarterly contribution variables as elements of an array.
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4;
Firstelement
Secondelement
Thirdelement
Fourthelement
ID QTR4QTR2 QTR3QTR1
CONTRIBCONTRIB
...
Defining an Array
Variables that are elements of an array need not have similar, related or numbered names.
array Contrib2{4} Q1 Qrtr2 ThrdQ Qtr4;
...
QTR4QRTR2 THRDQQ1
CONTRIB2CONTRIB2
Firstelement
Secondelement
Thirdelement
Fourthelement
ID
Defining an Array
Array processing often occurs within DO loops. An iterative DO loop that processes an array has the following form:
To execute the loop as many times as there are elements in the array, specify that the values of index-variable range from 1 to number-of-elements-in-array.
DO index-variable=1 TO number-of-elements-in-array; additional SAS statements using array-name{index-variable}…END;
DO index-variable=1 TO number-of-elements-in-array; additional SAS statements using array-name{index-variable}…END;
Processing an Array
CONTRIB{QTR}CONTRIB{QTR}
4
CONTRIB{4}
3
CONTRIB{3}
2
CONTRIB{2}
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4;do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25;end;
QTR4
QTR2
QTR3
QTR1
1
Value of index variable Qtr
CONTRIB{1}
array reference
...
Firstelement
Secondelement
Thirdelement
Fourthelement
Processing an Array
...
When Qtr=1
Qtr1=Qtr1*1.25;
data charity(drop=Qtr); set prog2.donate; array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4; do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25; end; run;
Contrib{1}=Contrib{1}*1.25;
Performing Repetitive Calculations
...
When Qtr=2
Qtr2=Qtr2*1.25;
data charity(drop=Qtr); set prog2.donate; array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4; do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25; end; run;
Contrib{2}=Contrib{2}*1.25;
Performing Repetitive Calculations
When Qtr=3
...
Qtr3=Qtr3*1.25;
data charity(drop=Qtr); set prog2.donate; array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4; do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25; end; run;
Contrib{3}=Contrib{3}*1.25
Performing Repetitive Calculations
When Qtr=4
...
Qtr4=Qtr4*1.25;
data charity(drop=Qtr); set prog2.donate; array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4; do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25; end; run;
Contrib{4}=Contrib{4}*1.25;
Performing Repetitive Calculations
Partial PROC PRINT Output
ID Qtr1 Qtr2 Qtr3 Qtr4
E00224 15.00 41.25 27.50 .E00367 43.75 60.00 50.00 37.50E00441 . 78.75 111.25 112.50E00587 20.00 23.75 37.50 36.25E00598 5.00 10.00 7.50 1.25
proc print data=charity noobs;run;
Performing Repetitive Calculations
Calculate the percentage that each quarter's contribution represents of the employee's total annual contribution. Base the percentage only on the employee's actual contribution and ignore the company contributions.
Partial Listing of prog2.donate
ID Qtr1 Qtr2 Qtr3 Qtr4
E00224 12 33 22 .E00367 35 48 40 30
Creating Variables with Arrays
data percent(drop=Qtr); set prog2.donate; Total=sum(of Qtr1-Qtr4); array Contrib{4} Qtr1-Qtr4; array Percent{4}; do Qtr=1 to 4; Percent{Qtr}=Contrib{Qtr}/Total; end; run;
The second ARRAY statement creates four numeric variables: Percent1, Percent2, Percent3, and Percent4.
c07s3d1.sas
Creating Variables with Arrays
ID Percent1 Percent2 Percent3 Percent4
E00224 18% 49% 33% .E00367 23% 31% 26% 20%E00441 . 26% 37% 37%E00587 17% 20% 32% 31%E00598 21% 42% 32% 5%
proc print data=percent noobs; var ID Percent1-Percent4; format Percent1-Percent4 percent6.;run;
Partial PROC PRINT Output
Creating Variables with Arrays
Calculate the difference in each employee's actual contribution from one quarter to the next.
Partial Listing of prog2.donate
ID Qtr1 Qtr2 Qtr3 Qtr4
E00224 12 33 22 .E00367 35 48 40 30
Firstdifference
Seconddifference
Thirddifference
...
Creating Variables with Arrays
data change(drop=i); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{3}; do i=1 to 3; Diff{i}=Contrib{i+1}-Contrib{i}; end; run;
c07s3d2.sas
Creating Variables with Arrays
When i=1
...
Diff1=Qtr2-Qtr1;
data change(drop=i); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{3}; do i=1 to 3; Diff{i}=Contrib{i+1}-Contrib{i}; end; run;
Diff{1}=Contrib{2}-Contrib{1};
Creating Variables with Arrays
When i=2
...
Diff2=Qtr3-Qtr2;
data change(drop=i); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{3}; do i=1 to 3; Diff{i}=Contrib{i+1}-Contrib{i}; end; run;
Diff{2}=Contrib{3}-Contrib{2};
Creating Variables with Arrays
When i=3
...
Diff3=Qtr4-Qtr3;
data change(drop=i); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{3}; do i=1 to 3; Diff{i}=Contrib{i+1}-Contrib{i}; end; run;
Diff{3}=Contrib{4}-Contrib{3};
Creating Variables with Arrays
ID Diff1 Diff2 Diff3
E00224 21 -11 .E00367 13 -8 -10E00441 . 26 1E00587 3 11 -1E00598 4 -2 -5
proc print data=change noobs; var ID Diff1-Diff3; run;
Partial PROC PRINT Output
Creating Variables with Arrays
Determine the difference between employee contributions and last year‘s average quarterly goals of $10, $15, $5, and $10 per employee.
data compare(drop=Qtr Goal1-Goal4); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{4}; array Goal{4} Goal1-Goal4 (10,15,5,10); do Qtr=1 to 4; Diff{Qtr}=Contrib{Qtr}-Goal{Qtr}; end;
run;
Assigning Initial Values
ID Diff1 Diff2 Diff3 Diff4
E00224 2 18 17 .E00367 25 33 35 20E00441 . 48 84 80E00587 6 4 25 19E00598 -6 -7 1 -9
proc print data=compare noobs; var ID Diff1 Diff2 Diff3 Diff4;run;
Partial PROC PRINT Output
Assigning Initial Values
Proc SQL
Arrays / DO-END
Retain / First. Last.
Agenda – Module 3
_N_ and _ERROR_
» N_ indicates the number of times SAS has looped through the DATA step (not necessarily equal to the observation number)
» _ERROR_ has a value of 1 if there is a data error for that observation and 0 if there isn’t
FIRST. variable and LAST. Variable
» FIRST. variable and LAST. variable are available when using a BY statement in a DATA step.
» The FIRST. variable will have a value 1 when SAS is processing an observation with the first occurrence of a new value for that variable and a value of 0 for the other observations.
» Similarly for LAST. variable, value is 1 for an observation with the last occurrence of a value for that variable.
Using SAS Automatic Variables
data real_life; input person topicA;cards;1 0 1 1 3 -1 1 0 2 0 1 1 2 -1 2 -1 3 0 3 1 4 0 1 1 4 1 4 0 2 -1 4 0 4 0 1 -1 ;run;
The goal is to compare each observation with the previous and the next observation. If they are the same then flag the observation….
Use of Retain, first. and last.
We need to number the observations within each person. We will be using first. person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.
proc sort data=real_life out=sort_real; by person;run;
data count_real; set sort_real; retain count; by person; if first.person then count = 0; count = count + 1;run;
proc print data=count_real noobs;run;
….Using first.
data wide_real; set count_real; array AtopicA(6) topicA_1-topicA_6; retain topicA_1-topicA_6; by person; if first.person then do; do i = 1 to 6; AtopicA[i] = .; end; end; AtopicA(count) = topicA; /*looping across values in the
variable count*/ if last.person then output; /* outputs only the last obs
per person */run;
proc print data=wide_real noobs; var person topicA_1-topicA_6;run;
We now convert the data set from long to wide.
Note: We are using first. person and last. person but we do not need to resort the data since it is already sorted on person.
….Use of both first. and last.
Now, let's find the people who have the same value for 3 observations in a row.
data three; set wide_real; array topic(6) topicA_1-topicA_6; do i = 2 to 5; if topic[i-1] ne . & topic[i] ne . & topic[i+1]
ne . & topic[i]=topic[i-1] & topic[i]=topic[i+1] then
flagA=1; end; if flagA=. then flagA=0;run;
proc print data=three noobs; var person topicA_1-topicA_6 flagA;run;
….Use of both first. and last.
17
Thank you !
© 2006, Cognizant Technology Solutions. Confidential