Upload
doanlien
View
218
Download
1
Embed Size (px)
Citation preview
Getting the most from Hash objects
Techniques covered are:
• SQL join
• Data step merge using BASE engine
• Data step merge using SPDE merge
• Index Key lookup
• Hash object lookup
• Simple join, iteration and sorting with hash
Test setup
• Only using local SAS datasets.
• Datasets (including SPDE) reside in the local work library directory.
• Storing data locally nullifies the network delays and I/O delays.
• The environment has a super fast solid state hard disk which decreases I/O delay.
• Same join logic is implemented over all the techniques.
PROC SQL
• Requires no sorting prior to joins
• Multi way joins can be performed.
• SQL uses internal utility tables for joins
Rows in pred_cost 13,160,905
Rows in speciality 16,170,805
Rows in acute_pred(inner join)
12,001,616
real time user cpu time
0:20:28.790:14:15.31
Memory 254512.00k
Proc sql;
create table acute_pred as
select a.*,b.nwau_sas from
pred_cost as a inner join speciality
as b
on a.yearid = b.yearid and
a.facility_identifier=
b.facility_identifier
;
quit;
Data step merge
• It requires a sort first by the key variables in all the input datasets.
Rows in pred_cost 13,160,905
Rows in speciality 16,170,805
Rows in acute_pred(inner join)
12,001,616
real time user cpu time (avg
combined sorting and merging)
0:30:28.790:21:15.31
Memory(combined avg) 162393.59k
Proc sort data=pred_cost out= srt_pred_cost;
by yearid facility_identifier;
Run;
Proc sort speciality out= srt_spec (keep= yearid
facility_identifier nwau_sas);
by yearid facility_identifier;
Run;
Data acute_pred;
Merge srt_pred_cost(in=a) srt_spec(in=b);
by yearid facility_identifier;
If a and b;
Run;
SPDE merge
• SPDE stands for Scalable Performance data engine
• SPDE combines software and hardware capabilities.
Rows in pred_cost 13,160,905
Rows in speciality 16,170,805
Rows in acute_pred(simple join)
12,001,616
real time user cpu time
0:16:15.310:12:28.79
Memory 1261040.70k
libname workspde spde "%sysfunc(pathname(work))" temp=yes;
proc copy in=work out= workspde;
select pred_cost specialty;
run;
data workspde.acute_pred;
merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16
(in=b keep= nwau_sas yearid facility_identifier);
by yearid facility_identifier;
If a and b;
run;
Index key lookup
• Main dataset is not sorted.
• Multiple set statements with key= option .
• _IORC_ is one of the key automatic variable which needs attention.
Rows in pred_cost 13,160,905
Rows in speciality 16,170,805
Rows in acute_pred(inner join)
12,001,616
real time user cpu time
0:12:15.310:06:28.79
Memory 142279.71k
Proc datasets lib=work nolist;
Modify speciality;
Index create myindex=(yearid facility_identifier);
run;
Data acute_pred;
set pred_cost;
Set speciality key=myindex ;
If _IORC_ =0 ;
run;
HASH Objects
• One of the fastest approaches for the look up activities.
• In memory computation makes this technique the fastest.
• Dataset size is a key factor for memory consumption.
• Pre sorting is not required.
• Hash object needs to be defined and instantiated.
• Lookups are performed with the find() function.
Data acute_pred;
length nwau_sas 8.;
If _n_=1 then do;
declare hash hn16(dataset: ‘speciality’,duplicate:’e’);
hn16.definekey(‘yearid’,’facility_identifier’);
hn16.definedata(‘nwau_sas’);
hn16.definedone();
end;
set pred_cost;
Rc=hn16.find(key :yearid, key:facility_identifier);
If rc=0;
Run;
HASH Objects
• Simple inner joins and left joins can be achieved by keeping a check on the return code variable.
Rows in pred_cost 13,160,905
Rows in speciality 16,170,805
Rows in acute_pred(inner join)
12,001,616
real time user cpu time
0:08:28.790:03:15.31
Memory 1465437.28k
if rc=0;
• Hash iterator objects need to be defined and instantiated.
• Hash objects are assigned to hash iterator objects.
Data top bottom;
length nwau_sas 8.;
If _n_=1 then do;
declare hash hn16(dataset:’speciality’, ordered:’descending’);
hn16.definekey(‘yearid’,’facility_identifier’);
hn16.definedata(all:’Y’);
hn16.definedone();
Declare hiter iter_ex(‘hn16’);
End;
/*hn16.output(dataset:’sort_dec’)*/
HASH Iterators
HASH Iterators
• First() and next() functions to iterate top to bottom and last() and prev() functions to iterate bottom to top
• Can be used to get the top and bottom records.
Declare hiter iter_ex(‘hn16’);
End;
Iter_ex.First();
do I = 1 to 10;
output top;
Iter_ex.Next();
end;
Iter_ex.last();
do I = 1 to 10;
output bottom;
Iter_ex.prev();
end;
Run;
Avoid or use?
• Where should you apply these techniques?
Type Proc SQLDatastep
Merge
Index lookup (key=)
SPDE Merge
Hash lookup
Hash Iterators
Small to medium sized tables (< 10,000 rows) ✓ ✓
Huge table (> 10 million rows) ✓ ✓ ✓ ✓ ✓
Unsorted data(> 10 million rows) ✓ ✓ ✓ ✓ ✓
Datasets with multiple indexes ✓ ✓
Less memory hungry techniques ✓ ✓
Getting the most from Hash objects
Bharath Gowda, SAS analyst
Independent Contractor
0468304568 | [email protected]