11
1 Generation Data Groups and their Data Set Metadata (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store a short tag and a longer comment with each unique version of a Generation Data Group (GDG)? Have you ever wanted to see a chronological list of all data sets within a GDG showing the complete name with “#nnn”, creation date and current relative position number? These information items can be added or are already available in the data set metadata contained in DICTIONARY.TABLES. The presenter will show how GDG’s are initiated and documented with descriptive metadata. Then a macro will be shown to access the metadata information. Finally the presenter will discuss changes in 9.4 to allow further Extended Data Set Attributes. Generation Data Group background Beginning with version 7 SAS has offered Generation Data Group (GDG) support to provide automatic “archiving” of prior versions of a data set. The programmer initiates and specifies the number of versions to be saved with the GENMAX= option. The SAS GDG support automatically keeps track of relative age of each archived version and drop the oldest when the genmax limit is exceeded. This is accomplished by modifying the most current “base” version data set name with four additional characters: “#nnn” where nnn is an ascending sequential integer from 1 to 999. Without explaining how the count wraps at 1000, we can simply observe the oldest archived version is the one with the lowest number. As an example assume a GDG of dataset name “cars” and genmax=4 contains the following Generation Data Sets: cars#004 cars#005 cars#006 cars From this list of data sets we know the current “base” version is named “cars”. The oldest archived version was re-named “cars#004”. The latest archived version was re-named “cars#006” when SAS wrote another data set “cars” replacing it as the base version. The next time SAS creates a data set named “cars” the existing base version will be archived and given the revised name “cars#007”. Furthermore, because the list of the GDG then exceeds the value of genmax, the oldest, identified with the lowest sequential number, “cars#004” will be dropped.

Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

1

Generation Data Groups and their Data Set Metadata

(AKA Dropping Bread Crumbs)

Jess Watkins, Senior Consultant, Scottsdale, Arizona

ABSTRACT

Have you ever wanted to store a short tag and a longer comment with each unique version of a Generation Data Group (GDG)? Have you ever wanted to see a chronological list of all data sets within a GDG showing the complete name with “#nnn”, creation date and current relative position number? These information items can be added or are already available in the data set metadata contained in DICTIONARY.TABLES. The presenter will show how GDG’s are initiated and documented with descriptive metadata. Then a macro will be shown to access the metadata information. Finally the presenter will discuss changes in 9.4 to allow further Extended Data Set Attributes.

Generation Data Group – background

Beginning with version 7 SAS has offered Generation Data Group (GDG) support to provide automatic “archiving” of prior versions of a data set. The programmer initiates and specifies the number of versions to be saved with the GENMAX= option. The SAS GDG support automatically keeps track of relative age of each archived version and drop the oldest when the genmax limit is exceeded. This is accomplished by modifying the most current “base” version data set name with four additional characters: “#nnn” where nnn is an ascending sequential integer from 1 to 999. Without explaining how the count wraps at 1000, we can simply observe the oldest archived version is the one with the lowest number. As an example assume a GDG of dataset name “cars” and genmax=4 contains the following Generation Data Sets:

cars#004 cars#005 cars#006 cars From this list of data sets we know the current “base” version is named “cars”. The oldest archived version was re-named “cars#004”. The latest archived version was re-named “cars#006” when SAS wrote another data set “cars” replacing it as the base version. The next time SAS creates a data set named “cars” the existing base version will be archived and given the revised name “cars#007”. Furthermore, because the list of the GDG then exceeds the value of genmax, the oldest, identified with the lowest sequential number, “cars#004” will be dropped.

Page 2: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

2

Why set up a GDG? For projects that require the retention of complete prior data set versions, a Generation Data Group provides an easy way to accomplish such without requiring the user to rename backup copies to be archived. Among many project design strategies, two are very common reasons to set up a GDG. Audit requirements to provide a trace backward and explanation of underlying data evolution. Disaster backup wherein the user needs to restore to the status of weeks ago and start over. The GDG makes the automatic creation of an archive copy, however no unique data set renaming is performed. Thus it makes the task of keeping track of backup versions difficult. The purpose of this paper is that of explaining an easy solution to this problem.

Accessing dataset versions within a GDG SAS procedures, PROC SQL and DATA steps can access the base version in the same way they did prior to declaring the data set a GDG: data=cars specifying input to most procedures. set cars; specifying input to a data step from cars specifying data set to be used in a sql query To access one of the archived prior versions a new parameter (GENNUM=n) is specified with the dataset. This is called the generation number. If the value is negative, this refers to a relative generation number, in that it is relative to the base data set. For example from a SAS procedure using a “DATA=” option: data = cars (gennum= -1 ) will give access to cars#006, one previous to the base. data = cars (gennum = -2 ) will give access to cars#005, two prior to the base. data= cars (gennum = -3 ) will give access to cars#004, (the oldest) three prior to the base. data = cars (gennum = -4 ) causes a program exception at run time and aborts the step. (because the generation does not exist in our example due to the GENMAX option). Instead of a relative generation number an actual generation number can be specified. This is accomplished by specifying a positive number for GENNUM. data = cars (gennum = 5 ) will give access to cars#005, as long as it is still available. Even though the GDG software will translate an actual generation number like the above example, the programmer may not hard code “data = cars#005”

Descriptive information about a dataset – its METADATA SAS stores a wealth of information about each dataset at the time it is created and/or further modified. Some of the information is strictly descriptive; other information is used by SAS to make future dataset management decisions. The value of “genmax” is one such metadata value. (In addition to metadata information about the dataset, there is also metadata information for each variable. This is outside the scope of this paper.)

Page 3: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

3

Accessing the metadata information can be accomplished in several ways. The two most popular methods are PROC CONTENTS and PROC SQL query of DICTIONARY.TABLES. PROC CONTENTS is very easy to initiate for each individual Generation Data Set in the GDG. All that is required is the specification of the gennum value if requesting information about a version other than the current base. However, to gather metadata information into macro variables for use by an application program is a difficult programming task. Furthermore, to assemble a descriptive recap of all generation data sets in the GDG requires as many PROC CONTENTS steps as the value of genmax. A much easier method is to use PROC SQL to gather information from the dataset view DICTIONARY.TABLES. This view is constantly updated by SAS after every program step. The view spans every dataset in every defined library of the SAS session. A simple PROC PRINT of DICTIONARY.TABLES view will produce a very long report of everything about everything. At the data set level (SAS 9.3) there are 41 individual metadata values for each. To limit the scope, a PROC SQL query is used to specify the exact data set and the list of only those metadata items desired. The library name and data set name specified with a WHERE clause. The library specification is simply “where libname = “MYDEMO”. (Note: the libname must be all caps.) The dataset name is refered to as memname = “MYDATA”. To access an archived version of a GDG one might try adding (gennum= -2 ), however such will not work. Within this PROC SQL query context only the full dataset name with “#nnn” will access a version prior to the base.

Descriptive recap of all data sets in a GDG The above conundrum is almost a “chicken and egg” question. While one could access the server –files view to look up the list of generation #nnn’s, make the mental conversion from relative to actual and enter the full data set name, there is a much easier way that solves two requirements. Not only is the metadata information on a specific dataset desired, frequently a recap of all data set members of the GDG is useful. This paper will illustrate how to solve the second requirement and use the list of the entire GDG to access a relative version’s specific metadata information. The PROC SQL query is expanded to create a table with one observation for each dataset in the GDG. This is accomplished by expanding the where clause to: where (memname = “MYDATA” or memname like “MYDATA#%”)

The LIKE condition with wild card designation “%” will select all the data sets with a “#” following the GDG data set base name. Adding the list of metadata values to populate our table is done with the following PROC SQL SELECT. select libname, memname, crdate, nobs, filesize,

maxgen, gen, memlabel from …

Most of these metadata values are identifiable from their label. Data set name is called “memname”. The crdate is a SAS Date/Time value showing the creation date. “nobs” refers to the number of observations and “filesize” (obviously the size of the file). The “genmax” value is referred to as “maxgen” for some unknown reason, “gen” is a sequential integer assigned to every generation created when it is first written as the new base version and “memlabel” is defined as a character value of up to 256 positions, currently with nothing stored in it.

Page 4: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

4

The first step is to use the value of the metadata variable “gen” and translate it to a relative generation. Using the fact that the highest value of “gen” identifies the most recent base version, the relative generation number is derived with “gen – max(gen) as relative” and placed into the select list. At this point many programmers would jump to building the full data set member name with #nnn appended, however that is not necessary. (Furthermore note the one off difference between data set #nnn value and the value of gen). Instead, by following with another PROC SQL step querying the table just built, the specific relative version can be retrieved just with another qualification “where relative = &rel.” The &rel. representing a macro variable containing the text value of the relative designation typically supplied via a user prompt.

Adding descriptive information to each dataset version Returning to the question of how to imbed tags and comments for each version and avoid the need for separate reference notes, the metadata variable “memlabel” can be used. The metadata variable is referred to as LABEL= in SAS base code. The variable can be written at the time of data set creation in a DATA step, PROC SQL and some other procedures. The variable can also be established or overwritten after data set creation with PROC DATASETS MODIFY. In each of the base code references the syntax is simply LABEL=. In SAS 9.3 only one such “safe to use” metadata character string is available for programmer assignment. In this example the 256 position character string is divided into two descriptive elements, a short tag and a longer comment. A unique character such as the tilde “~” is chosen as the delimiting separator. The result is an easy to use way of placing multiple descriptive information into the data set’s metadata. Unfortunately a way to change just one of the tag/comment elements without re-specifying the other may present problems. (A discussion of new features in SAS 9.4 at the end of this paper will show exciting solutions to this and many other challenges.)

Putting it all together in a macro The macro about to be described must satisfy the following: Given a library name and GDG name generate a table of all GDG data set versions;

1. Derive and add the relative GDG version number to each 2. Arrange the table into order by most current to oldest 3. Optionally print the table 4. Return the metadata values of a selected version number specified by either

Relative Actual Omitted returns values of the current GDG base version The macro has four positional passed parameters: Libname (required) Dataset name (required) Relative version (optional) Print (optional) The macro will retrieve and place results into optional user specified global macro variables. (If the variable is not defined as global, if will defined as a local macro variable and be discarded upon exit.)

Page 5: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

5

The statements to establish global macro variables prior to the macro program usage are: %let gdg_name=; %let gdg_time=; %let gdg_date=; %let gdg_tdate=; %let gdg_tag=; %let gdg_cmnt=; The macro is made up of two distinct steps:

Building and printing the GDG table

Retrieving the metadata information for the relative version

Building and printing the GDG table In addition to declaring local macro variables to hold the four passed values, two other local macro variables are declared. “&matchto” will be used to build the full data set name when an absolute generation number is supplied. “&memlbl” will be used to temporarily hold the value of memlabel for parsing. Note the %UPCASE function applied to both the library name and the data set name.

Page 6: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

6

Retrieving the metadata information The macro’s WHERE clause will make selection using either the negative relative version number orexactly match the positive generation number. Note the sequence of the where clause evaluation is critical. First, a match on either a negative or “0” relative value may get a match. If not, then and only then will the name match be evaluated and satisfied with an exact match to data set name. Note this works without having to translate to an integer value. The PROC SQL SELECT statement places the metadata values into the pre-specified global macro variable names. The metadata string stored in the memlabel is moved to a temporary local variable named &memlbl. Macro programming is then executed to parse the string value into tag and comment. Note the code will work when only a tag is specified or only a comment is specified (provided there is a tilde character in position 1)

Page 7: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

7

Examples of usage in a SAS project In the first example the user is prompted to enter the information to be placed into the tag and comment of a new base generation data set built from importing an Excel worksheet.

A display of the GDG data sets is requested before and after to verify successful completion. Note the value zero is optional, however, the print option must be in the fourth position.

Page 8: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

8

The second example is that of a report application that prompts the user to enter a relative generation number followed by a PROC TABULATE report.

Within various locations such as TITLE, FOOTNOTE, the BOX= option and KEYLABEL statement, the integration of metadata information is imbedded into the report’s output.

Page 9: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

9

The above example is executed a second time with &RelGen changed from “-2” to “-3” and the “Print” option requested:

Could the values of metadata information be used for program decisions within a process? The developer is only limited by his/her imagination.

Page 10: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

10

SAS 9.4 enhancements to programmer specified metadata With SAS 9.4 the programmer defined metadata values are now possible to add more description and documentation. Referred to as User Defined Extended Attributes, any number can be defined at either the dataset or variable level. In comparison to using the memlabel metadata element, the advantages are quite numerous:

Unlimited number instead of just memlabel

32K limit on size rather than only 256 character memlabel

Optionally define as a SAS 8 byte numeric value rather than limited to character

Attach the SAS base code that created the data set to its metadata

Retrieve source code text and conditionally execute it in a macro

Beyond text, why not graphics, sound even videos? - maybe in a future release. The simple “how to” for the User Defined Attributes: (dataset level, see reference for variable level) Add or overwrite to an existing dataset with PROC DATASETS modify mydata;

XATTR ADD DS tag = “this is a short tag”;

XATTR ADD DS cmnt = “this is a much longer comment”;

Programmer will probably create a one line macro call for this such as %addextattr (libname, dsname, atname, atvalue);

Retrieving Extended Attribute values will be much more difficult ODS OUTPUT ExtendedAttributesDS = work.extattr;

PROC CONTENTS data = ;

To apply the above to GDG’s relative reference can be used as follows: PROC DATASETS lib= mylib nolist;

Modify mydata (gennum=-2);

XATTR ADD DS tag = “this is a short tag”;

XATTR ADD DS cmnt = “this is a much longer comment”;

ODS OUTPUT ExtendedAttributesDS = work.extattr;

PROC CONTENTS data = mylib.mydata (gennum= -2);

Conclusion Programmer defined and user supplied metadata descriptive information can be easily added to generation data sets within a GDG. Selection and retrieval of such is offered in a short macro program that will optionally produce of recap of the metadata information in each GDG dataset. Prior to SAS 9.4, such descriptive information was limited to the 256 character variable named “memlabel” and recorded with LABEL=. With SAS 9.4 the user can now call a PROC DATASETS macro to add or replace any number of User Defined Extended Attributes of almost unlimited length. Retrieval of such information for review purposes is easily performed with PROC CONTENTS. However the retrieval and capture of Extended Attributes, for use within a program, will be somewhat difficult until someone writes a user defined FUNCTION (not a macro) to return such.

Page 11: Generation Data Groups and their Data Set Metadata (AKA ... · (AKA Dropping Bread Crumbs) Jess Watkins, Senior Consultant, Scottsdale, Arizona ABSTRACT Have you ever wanted to store

11

REFERENCES SAS Data Files – Understanding Generation Data Sets – SAS 9.3 Language Reference : Concepts support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000934566.htm Kirk Paul Lafler – Exploring SAS Generation Data Sets

www2.sas.com/proceedings/sugi31/253-31.pdf

Lisa Eckler – Generation Why: How Generation Data Sets Can Help – SAS Global Forum 051-2012

support.sas.com/resources/papers/proceedings12/051-2012.pdf

Elena Muriel & Paul Simkin: Metadata for SAS 9 Programmers – SAS Global Forum 134-2008

www2.sas.com/proceedings/forum2008/134-2008.pdf

Peter Eberhardt & Ilene Brill: An Introduction to SAS Dictionary Tables – SUGI 31 259-31

www2.sas.com/proceedings/sugi31/259-31.pdf

Diane Olson: Developer Reveals: Extended Data Set Attributes – SAS Global Forum 135-2013

support.sas.com/resources/papers/proceedings13/135-2013.pdf

Chris Hemedinger: How to Store Data About Your Data in Your Data blogs.sas.com/content/sasdummy/2013/10/17/extended-attributes-sas-94

CONTACT INFORMATION Wayne (Jess) Watkins Sr. Consultant Scottsdale, Arizona tele (480) 206 3501 email [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of

SAS Institute Inc. in the USA and other countries ® indicates USA registration. Other brand and product

names are trademarks of their respective companies.