Upload
defconbond0073803
View
21
Download
2
Embed Size (px)
DESCRIPTION
sas loops and arrays course
Citation preview
1
Copyright © 2012, SAS Institute Inc. All rights reserved.
Innovative Techniques: Doing More with Loops and Arrays SAS Talks April 12, 2012
PLEASE STAND BY
Today’s event will begin at 1:00pm EST.
The audio portion of the presentation will be heard through your computer
speakers.
This is an automatic setup and is preferred. There will also be a limited option to
listen through the telephone to 250 lines.
If you would prefer to dial in, please call:
US Toll-Free: 1-888-682-4285
Toll/International: +1-973-368-0695
Conference Code: 4675179#
If you experience any technical difficulties,
you may contact WebEx Technical Support
at 866-229-3239.
#sastalks
Copyright © 2012, SAS Institute Inc. All rights reserved.
Innovative Techniques: Doing More with Loops and Arrays SAS Talks April 12, 2012
3
Copyright © 2012, SAS Institute Inc. All rights reserved.
Speakers
Stacy Hobson
Director, Customer Loyalty and Retention SAS Institute
Art Carpenter
Author and Senior Consultant California Occidental Consultants
4
Innovative Techniques: Doing More with Loops and Arrays
Arthur L. Carpenter California Occidental Consultants
10606 Ketch Circle Anchorage, AK 99515
(907) 865-9167 [email protected] www.caloxy.com
5
INTRODUCTION Techniques involving the use of DO loops and arrays provide the programmer with powerful and diverse methods for solving a wide range of types of problems. This presentation assumes that the audience has at least a passing understanding of the: • types of DO loops and their basic syntax • ARRAY statement and its various forms
The primary objective of this presentation is to demonstrate the interaction of DO loops and arrays.
6
Shorthand Variable Naming
Variables with a common prefix and a numeric suffix
array vis {10} visit1 - visit10;
(2.6.1)
Can create variables.
Introduction
The colon operator can be used for Named Prefix lists
array vis {10} visit:;
7
Shorthand Variable Naming
Specialized name lists address variables by their type. _CHARACTER_ All character variables _NUMERIC_ All numeric variables _ALL_ All variables on the PDV Since each of these lists pertain to the current list of variables, they will not create variables. In each case the resulting list of variables will be in the same order as they are on the Program Data Vector.
(2.6.1)
Introduction
8
DO Loop Specifications (3.9.2)
Introduction
do count=1 to 3, 5 to 20 by 5, 26, 33; . . . end;
COUNT: 1, 2, 3, 5, 10, 15, 20, 26, 33
Character index value specifications
do month = 'Jan', 'Feb', 'Mar'; . . . end;
Compound loop specifications
9
Special DO Loop Forms (3.9.3)
Introduction
do count=1 to 3; . . . end;
COUNT exits the loop with a value of COUNT=3
do count=1 to 3 until(count=3); . . . end;
COUNT exits the loop with a value of COUNT=4
Incremented infinite loop with conditional exit
do k=1 by 1 until(x=5); . . . end; Notice there is no TO specification
10
ARRAY Statement Forms (3.10.1)
array list {3} aa bb cc; array list {1:3} aa bb cc; array list {0:2} aa bb cc; array vis {16} visit1-visit16; array vis {*} visit1-visit16; array visit {16} ; array nvar {*} _numeric_; array nvar {*} _character_; array clist {3} $2 aa bb cc; array clist {3} $1 ('a', 'b', 'c'); array clist {4:6} $1 ('a', 'b', 'c');
Introduction
Array dimension = number of elements
11
ARRAY Statement Forms (3.10.1)
array list {3} aa bb cc; array list {1:3} aa bb cc; array list {0:2} aa bb cc; array vis {16} visit1-visit16; array vis {*} visit1-visit16; array visit {16} ; array nvar {*} _numeric_; array nvar {*} _character_; array clist {3} $2 aa bb cc; array clist {3} $1 ('a', 'b', 'c'); array clist {4:6} $1 ('a', 'b', 'c');
Introduction
These two array dimensions are the same.
12
ARRAY Statement Forms (3.10.1)
array list {3} aa bb cc; array list {1:3} aa bb cc; array list {0:2} aa bb cc; array vis {16} visit1-visit16; array vis {*} visit1-visit16; array visit {16} ; array nvar {*} _numeric_; array nvar {*} _character_; array clist {3} $2 aa bb cc; array clist {3} $1 ('a', 'b', 'c'); array clist {4:6} $1 ('a', 'b', 'c');
LIST{1} references the variable BB
Introduction
Index range
13
ARRAY Statement Forms (3.10.1)
array list {3} aa bb cc; array list {1:3} aa bb cc; array list {0:2} aa bb cc; array vis {16} visit1-visit16; array vis {*} visit1-visit16; array visit {16} ; array nvar {*} _numeric_; array nvar {*} _character_; array clist {3} $2 aa bb cc; array clist {3} $1 ('a', 'b', 'c'); array clist {4:6} $1 ('a', 'b', 'c');
Introduction
Dimension is calculated by SAS
14
ARRAY Statement Forms (3.10.1)
array list {3} aa bb cc; array list {1:3} aa bb cc; array list {0:2} aa bb cc; array vis {16} visit1-visit16; array vis {*} visit1-visit16; array visit {16} ; array nvar {*} _numeric_; array nvar {*} _character_; array clist {3} $2 aa bb cc; array clist {3} $1 ('a', 'b', 'c'); array clist {4:6} $1 ('a', 'b', 'c');
The three array elements are initialized with these three character values.
Introduction
No variable list means that variables are created: CLIST1, CLIST2, CLIST3
15
Temporary Arrays (3.10.2)
array visdate {16} _temporary_; array list {5} _temporary_ (11,12,13,14,15); array list {5} _temporary_ (11:15); array list {6} _temporary_ (6*3); array list {6} _temporary_ (2*1:3);
Introduction
No Permanent variables are created.
The dimension must be specified.
16
Temporary Arrays (3.10.2)
array visdate {16} _temporary_; array list {5} _temporary_ (11,12,13,14,15); array list {5} _temporary_ (11:15); array list {6} _temporary_ (6*3); array list {6} _temporary_ (2*1:3);
Introduction
No Permanent variables are created.
Initial values are assigned as a list inside of parentheses.
17
Temporary Arrays (3.10.2)
array visdate {16} _temporary_; array list {5} _temporary_ (11,12,13,14,15); array list {5} _temporary_ (11:15); array list {6} _temporary_ (6*3); array list {6} _temporary_ (2*1:3);
Introduction
No Permanent variables are created.
(3,3,3,3,3,3)
Initial values are assigned as a list inside of parentheses.
18
Temporary Arrays (3.10.2)
array visdate {16} _temporary_; array list {5} _temporary_ (11,12,13,14,15); array list {5} _temporary_ (11:15); array list {6} _temporary_ (6*3); array list {6} _temporary_ (2*1:3);
Introduction
No Permanent variables are created.
(3,3,3,3,3,3)
(1,2,3,1,2,3)
Initial values are assigned as a list inside of parentheses.
19
TRANSPOSING DATA This task is a simple introduction to the relationship between the DO loop and the array defined by the ARRAY statement.
(2.4)
Introduction
20
TRANSPOSING DATA Most, but not all, SAS procedures prefer to operate against normalized data, which tends to be tall and narrow, and often contains classification variables that are used to identify individual rows. Data in non-normal form tends to have one column for each level of one of the classification variables. Converting between the two forms can be accomplished using PROC TRANSPOSE or from within the DATA step. This is one of the 'classic' uses of arrays in conjunction with DO loops.
(2.4)
Introduction
21
TRANSPOSING DATA Most, but not all, SAS procedures prefer to operate against normalized data, which tends to be tall and narrow, and often contains classification variables that are used to identify individual rows. Data in non-normal form tends to have one column for each level of one of the classification variables. Converting between the two forms can be accomplished using PROC TRANSPOSE or from within the DATA step.
(2.4)
Normal Form
Obs SUBJECT VISIT sodium
1 208 1 13.7
2 208 2 14.1
3 208 4 14.1
4 208 5 14.1
5 208 6 13.9
6 208 7 13.9
7 208 8 14.0
8 208 9 14.0
9 208 10 14.0
10 209 1 14.0
. . . . portions of the table are not shown . . . .
Introduction
22
Transposing Rows to Columns (2.4.2)
Normal Form
Obs SUBJECT VISIT sodium
1 208 1 13.7
2 208 2 14.1
3 208 4 14.1
4 208 5 14.1
5 208 6 13.9
6 208 7 13.9
7 208 8 14.0
8 208 9 14.0
9 208 10 14.0
10 209 1 14.0
. . . . portions of the table are not shown . . . .
S v v v v v v v
U v v v v v v v v v i i i i i i i
B i i i i i i i i i s s s s s s s
J s s s s s s s s s i i i i i i i
O E i i i i i i i i i t t t t t t t
b C t t t t t t t t t 1 1 1 1 1 1 1
s T 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
1 208 13.7 14.1 . 14.1 14.1 13.9 13.9 14 14 14.0 . . . . . .
2 209 14.0 14.0 . 13.9 14.2 14.5 13.8 14 . 13.8 14 14.1 14.2 14.1 14 14.1
. . . . portions of the table are not shown . . . .
23
Transposing Rows to Columns
Commonly the process of transposing will involve the use of an array and an iterative DO loop.
data lab_nonnormal(keep=subject visit1-visit16); set lab_chemistry(keep=subject visit sodium); by subject; retain visit1-visit16 ; array visits {16} visit1-visit16; if first.subject then do i = 1 to 16;
visits{i} = .; end; visits{visit} = sodium; if last.subject then output lab_nonnormal;
run;
(2.4.2)
VISITS will become columns
Assign the value
SUBJECT remains as a CLASS variable. Introduction
24
Transposing Rows to Columns
Commonly the process of transposing will involve the use of an array and an iterative DO loop.
data lab_nonnormal(keep=subject visit1-visit16); set lab_chemistry(keep=subject visit sodium); by subject; retain visit1-visit16 ; array visits {16} visit1-visit16; if first.subject then do i = 1 to 16;
visits{i} = .; end; visits{visit} = sodium; if last.subject then output lab_nonnormal;
run;
(2.4.2)
Introduction
25
Transposing Columns to Rows
Assume that the visits are to become the classification variable.
data lab_normal(keep=subject visit sodium); set lab_nonnormal(keep=subject visit:);
by subject; array visits {16} visit1-visit16; do visit = 1 to 16; sodium = visits{visit}; output lab_normal;
end; run;
(2.4.2)
VISITS exist as columns
Use visit number as the array index.
The OUTPUT statement is inside of the DO loop.
Introduction
26
Transposing Columns to Rows (2.4.2)
S v v v v v v v
U v v v v v v v v v i i i i i i i
B i i i i i i i i i s s s s s s s
J s s s s s s s s s i i i i i i i
O E i i i i i i i i i t t t t t t t
b C t t t t t t t t t 1 1 1 1 1 1 1
s T 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
1 208 13.7 14.1 . 14.1 14.1 13.9 13.9 14 14 14.0 . . . . . .
2 209 14.0 14.0 . 13.9 14.2 14.5 13.8 14 . 13.8 14 14.1 14.2 14.1 14 14.1
. . . . portions of the table are not shown . . . .
Obs SUBJECT visit sodium
1 208 1 13.7
2 208 2 14.1
3 208 3 .
4 208 4 14.1
5 208 5 14.1
6 208 6 13.9
7 208 7 13.9
8 208 8 14.0
9 208 9 14.0
10 208 10 14.0
11 208 11 .
12 208 12 .
13 208 13 .
14 208 14 .
15 208 15 .
16 208 16 .
17 209 1 14.0
. . . . portions of the table are not shown . . . .
Introduction
27
ARRAY FUNCTIONS Three functions have been designed to work with arrays. DIM – return the dimension of the array
(3.10.3)
Introduction
data newchem(drop=i); set advrpt.lab_chemistry (drop=visit labdt); array chem {*} _numeric_; do i=1 to dim(chem);
chem{i} = chem{i}/100; end; run;
The number of numeric variables determines the array dimension.
28
ARRAY FUNCTIONS Three functions have been designed to work with arrays. DIM – return the dimension of the array
(3.10.3)
Introduction
data newchem(drop=i); set advrpt.lab_chemistry (drop=visit labdt); array chem {*} _numeric_; do i=1 to dim(chem);
chem{i} = chem{i}/100; end; run;
the DIM function returns the dimension
Although it does in this example, the index does not always vary from 1 to the dimension.
29
ARRAY FUNCTIONS LBOUND and HBOUND – Lower and upper array bounds
(3.10.3)
Introduction
data CloseHT; array heights {&lb:&hb} _temporary_;
do until(done); set advrpt.demog(keep=subject ht) end=done; heights(subject)=ht;
end; done=0; do until(done); set advrpt.demog(keep=subject ht) end=done; do Hsubj = lbound(heights) to hbound(heights); closeHT = heights{hsubj}; if (ht-1 le closeht le ht+1)
& (subject ne hsubj) then output closeHT; end; end; stop;
run;
Array bounds unknown to the programmer
30
ARRAY FUNCTIONS LBOUND and HBOUND – Lower and upper array bounds
(3.10.3)
Introduction
data CloseHT; array heights {&lb:&hb} _temporary_;
do until(done); set advrpt.demog(keep=subject ht) end=done; heights(subject)=ht;
end; done=0; do until(done); set advrpt.demog(keep=subject ht) end=done; do Hsubj = lbound(heights) to hbound(heights); closeHT = heights{hsubj}; if (ht-1 le closeht le ht+1)
& (subject ne hsubj) then output closeHT; end; end; stop;
run;
The array HEIGHTS is loaded.
Notice the use of ( ) instead of { }. Don't use ( ).
31
ARRAY FUNCTIONS LBOUND and HBOUND – Lower and upper array bounds
(3.10.3)
Introduction
data CloseHT; array heights {&lb:&hb} _temporary_;
do until(done); set advrpt.demog(keep=subject ht) end=done; heights(subject)=ht;
end; done=0; do until(done); set advrpt.demog(keep=subject ht) end=done; do Hsubj = lbound(heights) to hbound(heights); closeHT = heights{hsubj}; if (ht-1 le closeht le ht+1)
& (subject ne hsubj) then output closeHT; end; end; stop;
run;
Step through the array one element at a time
Read the data again
32
WORKING ACROSS OBSERVATIONS Because SAS reads one observation at a time into the PDV, it is difficult to ‘remember’ the values from an earlier observation (look-back) or to anticipate the values of a future observation (look-ahead). Without doing something extra, only the current observation is available for use.
(3.1)
33
Processing Within Groups The problems inherent with single observation processing are especially apparent when we need to work with our data in groups. The BY statement can be used to define groups, but the detection and handling of group boundaries is still an issue. Fortunately there is more than one approach to this type of processing.
(3.1.1)
Across Observations
34
BY Group Processing; First. and Last.
Count clinics and patient visits within each region.
(3.1.1)
data counter(keep=region clincnt patcnt); set regions(keep=region clinnum); by region clinnum; if first.region then do;
clincnt=0; patcnt=0; end; if first.clinnum then clincnt + 1; patcnt+1;
if last.region then output;
run;
Initialize the counters
Count items of interest (increment counters)
Write totals
Across Observations
35
BY Group Processing; First. and Last.
Count clinics and patient visits within each region.
(3.1.1)
data counter(keep=region clincnt patcnt); set regions(keep=region clinnum); by region clinnum; if first.region then do;
clincnt=0; patcnt=0; end; if first.clinnum then clincnt + 1; patcnt+1;
if last.region then output;
run;
clincnt + first.clinnum;
Across Observations
Across Observations 36
Transposing to Temporary Arrays
Used to handle more than one observation at a time.
(3.1.1)
data labvisits(keep=subject count meanlength); set advrpt.lab_chemistry; by subject; array Vdate {16} _temporary_;
retain totaldays count 0; if first.subject then do;
totaldays=0; count = 0; do i = 1 to 16; vdate{i}=.; end; end;
Temporary array to hold values of interest.
Initialize the counters
Across Observations 37
Transposing to Temporary Arrays
Used to handle more than one observation at a time.
(3.1.2)
vdate{visit} = labdt;
if last.subject then do;
do i = 1 to 15; between = vdate{i+1}-vdate{i};
if between ne . then do; totaldays = totaldays+between;
count = count+1; end; end; meanlength = totaldays/count;
output; end; run;
Load the array.
Process across values
Write the mean for this subject.
38
Transposing to Arrays
Code comment to clear an array of values.
(3.1.2)
do i = 1 to 16; vdate{i}=.; end;
call missing(of vdate{*}); The MISSING routine assigns missing values to its arguments.
Across Observations
39
Building a FIFO Stack
When processing across a series of observations for the calculation of statistics, such as running averages, a stack can be helpful. A stack is a collection of values that have automatic entrance and exit rules. Values tend to rotate through a stack Stacks come in two basic flavors; First-In-First-Out, FIFO, and Last-In-First-Out, LIFO. In a FIFO stack the oldest value in the stack is removed to make room for the newest value.
(3.1.7)
Across Observations
40
Building a FIFO Stack
A three day moving average of potassium levels is to be calculated for each subject.
(3.1.7)
data Average(keep=subject visit labdt potassium Avg3day); set labdates; by subject; * dimension of array is number of * items to be averaged; retain visitcnt .; array stack {0:2} _temporary_;
Index the array starting at 0
Across Observations
41
Building a FIFO Stack
A three day moving average of potassium levels is to be calculated for each subject.
(3.1.7)
if first.subject then do; call missing(of stack{*});
visitcnt=0; end; visitcnt+1; index = mod(visitcnt,3); stack{index} = potassium; avg3day = mean(of stack{*});
run;
Clear the stack for each subject.
The MOD function
determines the array INDEX.
Number of items in the stack.
Across Observations
42
Using Set Statement Options Although a majority of DATA steps use the SET statement, few programmers take advantage of its full potential. The SET statement has options that can be used to control how the data are to be read. END= used to detect the last observation from the incoming data set(s) KEY= specifies a index to be used when reading INDSNAME= used to identify the current data source NOBS= number of observations OPEN= determines when to open a data set POINT= designates the next observation to read UNIQUE used with KEY= to read from the top of the index
(3..8)
43
Using POINT= and NOBS= The SET statement by default reads one observation after another, first observation to last. The POINT= option makes it possible to perform a non-sequential read. The POINT= option identifies a temporary variable that indicates the number of the next observation to read. The NOBS= option also identifies a temporary variable, which after DATA step compilation will hold the number of observations on the incoming data set.
(3.8.1)
SET Statement Options
44
Using POINT= and NOBS= Randomly read a subset of observations from &DSN.
(3.8.1)
%macro rand_wo(dsn=,pcnt=0); data rand_wo(drop=cnt totl); totl = ceil(&pcnt*obscnt); array obsno {10000} _temporary_;
do until(cnt = totl); point = ceil(ranuni(0)*obscnt); if obsno{point} ne 1 then do; set &dsn point=point nobs=obscnt;
output; obsno{point}=1;
cnt+1; end; end; stop;
run; %mend rand_wo; %rand_wo(dsn=advrpt.demog,pcnt=.3)
Total observations to read.
Randomly select the
next observation to read.
The value of
POINT is the next observation to read.
Flag this observation,
it has now been read.
45
Using the DOW Loop The DOW loop, which is also known as the DO-W loop, was named for Ian Whitlock who popularized the technique and first demonstrated its efficiencies.
(3.9.1)
data implied;
set big; output implied; run;
The DATA step has an implied loop. The loop is executed once for each incoming observation.
data dowloop; do until(eof); set big end=eof;
output dowloop; end; stop;
run;
The implicit loop is replaced with an explicit one (here a DO UNTIL).
Terminate the step with a STOP.
46
Using the DOW Loop Merge a single observation summary data set onto the analysis data. Calculate percent change.
(3.9.1)
proc summary data=advrpt.demog; var wt; output out=means mean=/autoname; run; data Diff1; if _n_=1 then set means(keep=wt_mean); set advrpt.demog(keep=lname fname wt);
diff = (wt-wt_mean)/wt_mean; run;
Use the IF to control reading
the summary data set.
DOW Loop
47
Using the DOW Loop Merge a single observation summary data set onto the analysis data. Calculate percent change, using a DOW loop to merge.
(3.9.1)
data Diff2; set means(keep=wt_mean); do until(eof);
set advrpt.demog(keep=lname fname wt) end=eof;
diff = (wt-wt_mean)/wt_mean; output diff2; end; stop;
run;
The IF is not needed.
Terminate the DATA step.
DOW Loop
48
Key Indexing – A Simple Hash Use an array to hold values. Neither data set needs to be sorted.
(6.7.2)
data clinnames(keep=subject lname fname clinnum clinname); array chkname {999999} $35 _temporary_; do until(allnames);
set advrpt.clinicnames end=allnames; chkname{input(clinnum,6.)}=clinname;
end; do until(alldemog); set advrpt.demog(keep=subject lname fname clinnum)
end=alldemog; clinname = chkname{input(clinnum,6.)};
output clinnames; end; stop;
run;
data clinnames; merge demog clinicnames; by clinnum; run;
Read and store the names, indexed by clinic number.
Read a number and retieve its corresponding name from the array.
49
Use a Hash Object Hash objects also avoid the need to sort.
(6.7.2)
data hashnames(keep=subject clinnum clinname lname fname); if 0 then set advrpt.clinicnames; declare hash lookup(dataset: 'advrpt.clinicnames', hashexp: 8); lookup.defineKey('clinnum'); lookup.defineData('clinname');
lookup.defineDone(); * Read the primary data; do until(done); set advrpt.demog(keep=subject clinnum lname fname) end=done; if lookup.find() = 0 then output hashnames;
end; stop;
run;
data hashnames; merge demog clinicnames; by clinnum; run;
Load the names in the LOOKUP hash object, indexed by clinic number.
For each number retrieve the name.
DOW Loop
Many of the examples in this talk are based on material found in the new SAS Press book:
Carpenter's Guide to Innovative SAS® Techniques by Art Carpenter and are used with permission of the author.
50
51
Innovative Techniques: Doing More with Loops and Arrays
Arthur L. Carpenter California Occidental Consultants
10606 Ketch Circle Anchorage, AK 99515
(907) 865-9167 [email protected] www.caloxy.com
53
Copyright © 2012, SAS Institute Inc. All rights reserved.
Additional Resources
Carpenter's Guide to Innovative SAS Techniques
Upcoming Live Webinars
April 25: Live at SAS Global Forum: Building Better Business Intelligence with SAS®
May 10: Social Networks in Data Mining: Challenges and Applications
Upcoming Live Events
2012 SAS Global Forum – livestream of the conference
» April 22 – 25, 2012
SAS Talks on support.sas.com
Follow along on Twitter using #sastalks