Getting the Most from Hash Objects Bharath Gowda Groups...Getting the most from Hash objects Techniques covered are: •SQL join •Data step merge using BASE engine •Data step merge

Getting the Most from Hash Objects

Bharath Gowda

Getting the most from Hash objects

Techniques covered are:

• SQL join

• Data step merge using BASE engine

• Data step merge using SPDE merge

• Index Key lookup

• Hash object lookup

• Simple join, iteration and sorting with hash

Test setup

• Only using local SAS datasets.

• Datasets (including SPDE) reside in the local work library directory.

• Storing data locally nullifies the network delays and I/O delays.

• The environment has a super fast solid state hard disk which decreases I/O delay.

• Same join logic is implemented over all the techniques.

PROC SQL

• Requires no sorting prior to joins

• Multi way joins can be performed.

• SQL uses internal utility tables for joins

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(inner join)

12,001,616

real time user cpu time

0:20:28.790:14:15.31

Memory 254512.00k

Proc sql;

create table acute_pred as

select a.*,b.nwau_sas from

pred_cost as a inner join speciality

as b

on a.yearid = b.yearid and

a.facility_identifier=

b.facility_identifier

;

quit;

Data step merge

• It requires a sort first by the key variables in all the input datasets.




12,001,616

real time user cpu time (avg

combined sorting and merging)

0:30:28.790:21:15.31

Memory(combined avg) 162393.59k

Proc sort data=pred_cost out= srt_pred_cost;

by yearid facility_identifier;

Run;

Proc sort speciality out= srt_spec (keep= yearid

facility_identifier nwau_sas);


Run;

Data acute_pred;

Merge srt_pred_cost(in=a) srt_spec(in=b);


If a and b;

Run;

SPDE merge

• SPDE stands for Scalable Performance data engine

• SPDE combines software and hardware capabilities.



Rows in acute_pred(simple join)

12,001,616


0:16:15.310:12:28.79

Memory 1261040.70k

libname workspde spde "%sysfunc(pathname(work))" temp=yes;

proc copy in=work out= workspde;

select pred_cost specialty;

run;

data workspde.acute_pred;

merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16

(in=b keep= nwau_sas yearid facility_identifier);


If a and b;

run;

Index key lookup

• Main dataset is not sorted.

• Multiple set statements with key= option .

• _IORC_ is one of the key automatic variable which needs attention.




12,001,616


0:12:15.310:06:28.79

Memory 142279.71k

Proc datasets lib=work nolist;

Modify speciality;

Index create myindex=(yearid facility_identifier);

run;

Data acute_pred;

set pred_cost;

Set speciality key=myindex ;

If _IORC_ =0 ;

run;

HASH Objects

• One of the fastest approaches for the look up activities.

• In memory computation makes this technique the fastest.

• Dataset size is a key factor for memory consumption.

• Pre sorting is not required.

• Hash object needs to be defined and instantiated.

• Lookups are performed with the find() function.

Data acute_pred;

length nwau_sas 8.;

If _n_=1 then do;

declare hash hn16(dataset: ‘speciality’,duplicate:’e’);

hn16.definekey(‘yearid’,’facility_identifier’);

hn16.definedata(‘nwau_sas’);

hn16.definedone();

end;

set pred_cost;

Rc=hn16.find(key :yearid, key:facility_identifier);

If rc=0;

Run;

HASH Objects

• Simple inner joins and left joins can be achieved by keeping a check on the return code variable.




12,001,616


0:08:28.790:03:15.31

Memory 1465437.28k

if rc=0;

• Hash iterator objects need to be defined and instantiated.

• Hash objects are assigned to hash iterator objects.

Data top bottom;

length nwau_sas 8.;

If _n_=1 then do;

declare hash hn16(dataset:’speciality’, ordered:’descending’);

hn16.definekey(‘yearid’,’facility_identifier’);

hn16.definedata(all:’Y’);

hn16.definedone();

Declare hiter iter_ex(‘hn16’);

End;

/*hn16.output(dataset:’sort_dec’)*/

HASH Iterators

HASH Iterators

• First() and next() functions to iterate top to bottom and last() and prev() functions to iterate bottom to top

• Can be used to get the top and bottom records.

Declare hiter iter_ex(‘hn16’);

End;

Iter_ex.First();

do I = 1 to 10;

output top;

Iter_ex.Next();

end;

Iter_ex.last();

do I = 1 to 10;

output bottom;

Iter_ex.prev();

end;

Run;

Avoid or use?

• Where should you apply these techniques?

Type Proc SQLDatastep

Merge

Index lookup (key=)

SPDE Merge

Hash lookup

Hash Iterators

Small to medium sized tables (< 10,000 rows) ✓ ✓

Huge table (> 10 million rows) ✓ ✓ ✓ ✓ ✓

Unsorted data(> 10 million rows) ✓ ✓ ✓ ✓ ✓

Datasets with multiple indexes ✓ ✓

Less memory hungry techniques ✓ ✓

Getting the most from Hash objects

Bharath Gowda, SAS analyst

Independent Contractor

0468304568 | [email protected]

Documents

Getting the Most from Hash Objects Bharath Gowda Groups...Getting the most from Hash objects Techniques covered are: •SQL join •Data step merge using BASE engine •Data step merge