14
Getting the Most from Hash Objects Bharath Gowda

Getting the Most from Hash Objects Bharath Gowda Groups...Getting the most from Hash objects Techniques covered are: •SQL join •Data step merge using BASE engine •Data step merge

Embed Size (px)

Citation preview

Getting the Most from Hash Objects

Bharath Gowda

Getting the most from Hash objects

Techniques covered are:

• SQL join

• Data step merge using BASE engine

• Data step merge using SPDE merge

• Index Key lookup

• Hash object lookup

• Simple join, iteration and sorting with hash

Test setup

• Only using local SAS datasets.

• Datasets (including SPDE) reside in the local work library directory.

• Storing data locally nullifies the network delays and I/O delays.

• The environment has a super fast solid state hard disk which decreases I/O delay.

• Same join logic is implemented over all the techniques.

PROC SQL

• Requires no sorting prior to joins

• Multi way joins can be performed.

• SQL uses internal utility tables for joins

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(inner join)

12,001,616

real time user cpu time

0:20:28.790:14:15.31

Memory 254512.00k

Proc sql;

create table acute_pred as

select a.*,b.nwau_sas from

pred_cost as a inner join speciality

as b

on a.yearid = b.yearid and

a.facility_identifier=

b.facility_identifier

;

quit;

Data step merge

• It requires a sort first by the key variables in all the input datasets.

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(inner join)

12,001,616

real time user cpu time (avg

combined sorting and merging)

0:30:28.790:21:15.31

Memory(combined avg) 162393.59k

Proc sort data=pred_cost out= srt_pred_cost;

by yearid facility_identifier;

Run;

Proc sort speciality out= srt_spec (keep= yearid

facility_identifier nwau_sas);

by yearid facility_identifier;

Run;

Data acute_pred;

Merge srt_pred_cost(in=a) srt_spec(in=b);

by yearid facility_identifier;

If a and b;

Run;

SPDE merge

• SPDE stands for Scalable Performance data engine

• SPDE combines software and hardware capabilities.

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(simple join)

12,001,616

real time user cpu time

0:16:15.310:12:28.79

Memory 1261040.70k

libname workspde spde "%sysfunc(pathname(work))" temp=yes;

proc copy in=work out= workspde;

select pred_cost specialty;

run;

data workspde.acute_pred;

merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16

(in=b keep= nwau_sas yearid facility_identifier);

by yearid facility_identifier;

If a and b;

run;

Index key lookup

• Main dataset is not sorted.

• Multiple set statements with key= option .

• _IORC_ is one of the key automatic variable which needs attention.

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(inner join)

12,001,616

real time user cpu time

0:12:15.310:06:28.79

Memory 142279.71k

Proc datasets lib=work nolist;

Modify speciality;

Index create myindex=(yearid facility_identifier);

run;

Data acute_pred;

set pred_cost;

Set speciality key=myindex ;

If _IORC_ =0 ;

run;

HASH Objects

• One of the fastest approaches for the look up activities.

• In memory computation makes this technique the fastest.

• Dataset size is a key factor for memory consumption.

• Pre sorting is not required.

• Hash object needs to be defined and instantiated.

• Lookups are performed with the find() function.

Data acute_pred;

length nwau_sas 8.;

If _n_=1 then do;

declare hash hn16(dataset: ‘speciality’,duplicate:’e’);

hn16.definekey(‘yearid’,’facility_identifier’);

hn16.definedata(‘nwau_sas’);

hn16.definedone();

end;

set pred_cost;

Rc=hn16.find(key :yearid, key:facility_identifier);

If rc=0;

Run;

HASH Objects

• Simple inner joins and left joins can be achieved by keeping a check on the return code variable.

Rows in pred_cost 13,160,905

Rows in speciality 16,170,805

Rows in acute_pred(inner join)

12,001,616

real time user cpu time

0:08:28.790:03:15.31

Memory 1465437.28k

if rc=0;

• Hash iterator objects need to be defined and instantiated.

• Hash objects are assigned to hash iterator objects.

Data top bottom;

length nwau_sas 8.;

If _n_=1 then do;

declare hash hn16(dataset:’speciality’, ordered:’descending’);

hn16.definekey(‘yearid’,’facility_identifier’);

hn16.definedata(all:’Y’);

hn16.definedone();

Declare hiter iter_ex(‘hn16’);

End;

/*hn16.output(dataset:’sort_dec’)*/

HASH Iterators

HASH Iterators

• First() and next() functions to iterate top to bottom and last() and prev() functions to iterate bottom to top

• Can be used to get the top and bottom records.

Declare hiter iter_ex(‘hn16’);

End;

Iter_ex.First();

do I = 1 to 10;

output top;

Iter_ex.Next();

end;

Iter_ex.last();

do I = 1 to 10;

output bottom;

Iter_ex.prev();

end;

Run;

Avoid or use?

• Where should you apply these techniques?

Type Proc SQLDatastep

Merge

Index lookup (key=)

SPDE Merge

Hash lookup

Hash Iterators

Small to medium sized tables (< 10,000 rows) ✓ ✓

Huge table (> 10 million rows) ✓ ✓ ✓ ✓ ✓

Unsorted data(> 10 million rows) ✓ ✓ ✓ ✓ ✓

Datasets with multiple indexes ✓ ✓

Less memory hungry techniques ✓ ✓

Getting the most from Hash objects

Bharath Gowda, SAS analyst

Independent Contractor

0468304568 | [email protected]