24
© Copyright IBM Corporation 2012 Trademarks Get started with the IBM InfoSphere DataStage and QualityStage Operations Console Database, Part 1: An introduction Page 1 of 24 Get started with the IBM InfoSphere DataStage and QualityStage Operations Console Database, Part 1: An introduction A deep dive into the key relationships of the schema that defines the Operations Database Len Greenwood ([email protected]) DataStage Core Architect IBM Arron Harden ([email protected]) Senior Software Engineer IBM Geoff McClean ([email protected]) Senior Software Developer IBM Sumit Kumar ([email protected]) Senior Software Developer IBM Skill Level: Intermediate Date: 17 May 2012 This article is a deep dive into the schema of the IBM® InfoSphere® DataStage® and QualityStage® Operations Database, and the tables and columns that make up its key relationships. Specimen SQL queries are included to demonstrate how data can be read from these tables to answer specific operational questions. You can adapt these to build, for example, custom reports based on the operational data collected at your particular DataStage and QualityStage installation. View more content in this series Introduction The DataStage and QualityStage Operations Database (DSODB) is designed as a relational schema whose tables contain information about job runs and system resources used on a system that has a DataStage engine installed. This information

Schema Thatdefines the Operations Database

Embed Size (px)

Citation preview

Page 1: Schema Thatdefines the Operations Database

© Copyright IBM Corporation 2012 TrademarksGet started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 1 of 24

Get started with the IBM InfoSphere DataStageand QualityStage Operations ConsoleDatabase, Part 1: An introductionA deep dive into the key relationships of the schema thatdefines the Operations Database

Len Greenwood ([email protected])DataStage Core ArchitectIBM

Arron Harden ([email protected])Senior Software EngineerIBM

Geoff McClean ([email protected])Senior Software DeveloperIBM

Sumit Kumar ([email protected])Senior Software DeveloperIBM

Skill Level: Intermediate

Date: 17 May 2012

This article is a deep dive into the schema of the IBM® InfoSphere® DataStage®and QualityStage® Operations Database, and the tables and columns that makeup its key relationships. Specimen SQL queries are included to demonstrate howdata can be read from these tables to answer specific operational questions. Youcan adapt these to build, for example, custom reports based on the operationaldata collected at your particular DataStage and QualityStage installation.

View more content in this series

IntroductionThe DataStage and QualityStage Operations Database (DSODB) is designed asa relational schema whose tables contain information about job runs and systemresources used on a system that has a DataStage engine installed. This information

Page 2: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 2 of 24

is used to drive the Operations Console, but can also be queried directly given someknowledge of the key columns of the tables and the relationships among them.

This article describes the main tables and views of the DSODB, including all columnsused as primary or natural keys, and all foreign key relationships defined. It givesexamples of SQL SELECT statements, which can be used to answer specificquestions based on the information held by DSODB.

Online reference material for DSODB 8.7, including descriptions of all current tablesand columns, can be found in the schema document reference in the Resourcessection.

Table diagrams

Some pictures are included to illustrate the relationships between various tables.Each box represents a table or view. Relevant column names are shown, with iconsagainst each column to show their use as follows (Figure 1).

Figure 1. Key to types of column

Note that only the columns of each table that are mentioned in the text of this articleare shown. For a complete list of table columns, consult the schema documentreference in the Resources section.

The arrows between tables show the direction of relationships established by foreignkeys and SQL views, as shown next (Figure 2).

Page 3: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 3 of 24

Figure 2. Key to relationships

The HOST tableThe HOST table serves to partition the whole schema so information collected frommultiple engines can be stored in a single database schema if required (Figure 3).

Figure 3. HOST table keys

Each row in the HOST table represents a system on which all or part of anInfoSphere Information Server engine is installed. The HOSTID column is a surrogateprimary key; The HostName column is the system name as derived from thehostname operating system command or the HA_ALIAS environment variable, if that isset in the environment of the EngMonApp process when it starts.

For systems acting as a conductor node and on which an instance of theEngMonApp process is running, the InstallationDir column is set to a path name.This is the home directory for the Information Server engine installation, such as /opt/IBM/InformationServer/Server/DSEngine. The key of the HOSTS table is thecombination of HostName and InstallationDir, which allows for more than one engineto be installed on the same host system.

Page 4: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 4 of 24

The CreatedTimestamp column is the UTC time at which EngMonApp inserted therow. Each time a ResMonApp instance starts, it looks to see if there is a row in thetable with the values of HostName and InstallationDir it requires, and if so, it updatesthe MonStartTimestamp column; otherwise, it creates a new row. ResMonApp alsocreates new HOST entries, as required.

Remote nodes have InstallationDir set to a single hyphen character, '-'. Theseare systems whose HostNames have been defined in the DSODBConfig.cfg filevia the ResourceNode property. These rows have a CreatedTimestamp, but noMonStartTimestamp entry.

Cascading delete constraints are set so that deleting a row in HOST removes allrows that point back to it. Therefore, records of any jobs that ran on that system andall resource usage records are deleted at the same time.

Tables related to job runsThis section describes the tables used to record job run information in the defaultconfiguration. Note that all such rows ultimately relate to an entry in the HOSTS table(Figure 4). To completely identify a run via a SQL SELECT, so it is necessary to joinon the HOSTID column if there are runs from more than one engine recorded in thedatabase. (The SQL examples below assume that is the case.)

Figure 4. Key columns and relationships between tables related to JOBRUN

The JOBEXEC tableEvery time a new instance of a DataStage and QualityStage job run starts, it createsan initial event describing which version of a job executable has started, when, andwhat its invocation ID string is in the case of a multi-instanceable job. The JOBEXEC

Page 5: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 5 of 24

table holds details about each unique job executable for which a run has been seen.Note that this might not be the same list as would be found by querying the MetadataRepository; a job therein might never have been run, or it might no longer exist if ithas been deleted from the repository since it last ran.

Rows have a surrogate key JOBID used to relate all runs and other details of thisparticular version of the job. Job names are unique within a DataStage project,and project names are unique for a given host. Whenever a job is compiled, thejob's executable has potentially been changed, so one can distinguish betweenversions of a job's executable by including the compilation time. Hence, the uniqueprimary key for the table is a combination of the ProjectName, JobName, andCompilationTimestamp columns, combined with the HOSTID as a foreign key to theHOST table to identify the system where the project is located.

These rows are only created once. Each run of a job either finds an existing row putthere by a previous run of the same version of the job or creates a new one.

Listing 1. SQL Example: Querying the JOBEXEC table-- List names and locations of all job sequences that have-- ever been run on host H.SELECT X.ProjectName, X.FolderPath, X.JobName, X.CompilationTimestampFROM DSODB.JOBEXEC AS XJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = "H" AND X.JobType = "SEQ"ORDER BY X.ProjectName, X.FolderPath, X.JobName, X.CompilationTimestamp

Cascading delete constraints are set so that deleting an entry in JOBEXEC removesall records to do with any runs of that version of the job.

Listing 2. SQL Example: Deleting from the JOBEXEC table-- Delete all monitored information for all jobs that have-- been run from project P on host H.DELETEFROM DSODB.JOBEXECWHERE JOBID IN ( SELECT X.JOBID FROM DSODB.JOBEXEC AS X JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID WHERE H.HostName = "H" AND X.ProjectName = "P" )

The JOBRUN tableOne row is inserted in this table for each monitored run of a particular version ofa job. It is updated as the job run progresses. The RUNID column is a surrogate

Page 6: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 6 of 24

primary key, used to relate run details from other tables (such as job parameters andlog messages). The JOBID column is a foreign key to the row in the JOBEXEC tablethat contains the description of the exact version of the job executable that is running— which host is started on, in which project, the job's name, and its compilationtime. The other columns that form the unique primary key are InvocationId andCreationTimestamp. The former is only relevant for jobs with the multi-instanceableproperty set; this allows more than one instance of a job to be running at once,as long as each instance has a different invocation ID, which is an arbitrary stringassigned by the user at runtime. Note that if a run does not have an invocationID string, the value is set to a single hyphen. The CreationTimestamp is the UTCtime when the monitoring system created the first run event. Note that this mightbe different from the RunStartTimestamp, in the case of a run that gets queued.The RunStartTimestamp is the UTC time that the engine really launched the run, asopposed to when it was first submitted for running.

If a run is started from a job sequence, or via job control from another job, theCONTROLLING_RUNID column contains a foreign key to another entry in JOBRUN,which is the run that started this one. Therefore, all runs started by a particularinstance of a sequence can be related by their common CONTROLLING_RUNID,and these potentially form a tree-structured set of relationships if job sequences arenested.

As a run progresses, it generates other events when its status changes or when itproduces new values for some of its monitored properties. These cause its row inJOBRUN to be updated, and the LastUpdateTimestamp column records the UTCtime that last happened.

A note about run statuses: there are two columns in the JobRun table that holdstatus values: RunMajor Status and RunMinorStatus. The major status can be usedto quickly filter those runs that have finished from those that are starting up or stillrunning. The minor status gives the specific status for the run (see the schemadocument reference in the Resources section for all possible values).

Listing 3. SQL Example: Querying the JOBRUN table - 1

-- List start and finish times, names, and statuses of all jobs that have-- been run on host H, started after time YYYY-MM-DD HH:MM:SS,-- and have finished.SELECT X.ProjectName, X.JobName, R.RunStartTimestamp, R.RunEndTimestamp, R.RunMinorStatusFROM DSODB.JOBRUN AS RJOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBIDJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = "H" AND R.RunMajorStatus = "FIN" AND R.RunStartTimestamp >= "YYYY-MM-DD HH:MM:SS"ORDER BY R.RunStartTimestamp

Page 7: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 7 of 24

Listing 4. SQL Example: Querying the JOBRUN table - 2

-- List names of all jobs on host H that were running at time T and sort them-- in descending order of total CPU usage for the whole run.SELECT X.ProjectName, X.FolderPath, X.JobName, R.RunStartTimestamp, R.RunEndTimestamp, R.TotalCPUFROM DSODB.JOBRUN AS RJOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBIDJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = "H" AND R.RunStartTimestamp <= "YYYY-MM-DD HH:MM:SS" AND ( R.RunEndTimestamp >= "YYYY-MM-DD HH:MM:SS" OR R.RunEndTimestamp IS NULL )ORDER BY R.TotalCPU DESC

Cascading delete constraints are set so that deleting an entry in JOBRUN deletes allrelated details of the run (parameters, logs, metrics, etc.). If the run started other runs(if it was a job sequence, for example), all those runs are also deleted, along withtheir details.

Listing 5. SQL Example: Deleting from the JOBRUN table

-- Delete all job runs from project P on host H-- that finished normally.DELETEFROM DSODB.JOBRUNWHERE RUNID IN ( SELECT R.RUNID FROM DSODB.JOBRUN AS R JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID WHERE H.HostName = "H" AND X.ProjectName = "P" AND R.RunMinorStatus = '"FOK" )

The JOBRUNPARAMS table and JOBRUNPARAMSVIEW view

There is one row in the JOBRUNPARAMS table to describe the values of the jobparameters that have been set for a particular run. These are the values you wouldsee in a message at the beginning of the run's log in the Director client. The RUNIDcolumn of this table is a foreign key to the JOBRUN table for the run to which theseparameters relate.

The ParamList column contains the information as an XML string; you caneasily extract the names and values for the parameters by querying through theJOBRUNPARAMSVIEW view. This expands the XML into separate rows for eachindividual ParamName and ParamValue, with RUNID as the same foreign key.

Page 8: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 8 of 24

Listing 6. SQL Example: Querying the JOBRUNPARAMSVIEW view-- List names and start times of all job runs on host H-- that contained a parameter named P which had the value V at run time.SELECT X.ProjectName, X.JobName, R.RunStartTimestampFROM DSODB.JOBRUN AS RJOIN DSODB.JOBRUNPARAMSVIEW AS P ON P.RUNID = R.RUNIDJOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBIDJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = "H" AND P.ParamName = "P" AND P.ParamValue = "V"ORDER BY X.ProjectName, X.JobName, R.RunStartTimestamp

Constraints are set so that deleting an entry in JOBRUN deletes any correspondingentry in JOBRUNPARAMS.

The JOBRUNLOG table

As a job runs, certain of its log messages are captured and stored in theJOBRUNLOG table, where each row's RUNID column is a foreign key back to aRUNID in the JOBRUN table. Exactly which messages are captured depends onhow the DSODBConfig.cfg file is configured, but in general this is a small subsetof the overall jog logs, and contains only the job's initial and final messages, fatalmessages, and the first N warnings.

The EventId column is an integer that can be used to get messages in the orderemitted by the run, since the LogTimestamp column is the UTC time to the nearestsecond and might not be unique.

Listing 7. SQL Example: Querying the JOBRUNLOG table-- List job details and message info for all runs on host H that have-- emitted fatal messages since time YYYY-MM-DD HH:MM:SS.SELECT X.ProjectName, X.JobName, R.RunStartTimestamp, L.LogTimestamp, L.MessageId, L.MessageTextFROM DSODB.JOBRUN AS RJOIN DSODB.JOBRUNLOG AS L ON R.RUNID = L.RUNIDJOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBIDJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = "H" AND L.LogType = "FAT" AND R.RunStartTimestamp > "YYYY-MM-DD HH:MM:SS"ORDER BY R.RunStartTimestamp, R.RUNID, L.EventID

Constraints are set so that deleting an entry in JOBRUN deletes all correspondingentries in JOBRUNLOG.

The PARALLELCONFIG table and PARALLELCONFIGNODES view

Parallel job runs always specify a path to a file containing configuration informationfor the parallel engine (via the APT_CONFIG_FILE environment variable). The path

Page 9: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 9 of 24

is recorded in the ConfigFileName column of the JOBRUN table. However, thatpath might point to a temporary file, or the contents of the file my have changedsince the job ran. So the monitoring system also reads and parses the file andstores information about how many nodes the run was configured for in thePARALLELCONFIG table.

Since many runs can use the same configuration file, or at least files whoseparallelization instructions are equivalent, the schema saves space by making allruns with equivalent configurations point to the same row. So the CONFIGID columnin this table is a surrogate key and the CONFIGID column on the JOBRUN table is aforeign key to it. The PARALLELCONFIG table also has HOSTID as a foreign key, soconfigurations are partitioned by host that the run started on.

The list of physical and logical nodes in the configuration is held as an XML column.Use the PARALLELCONFIGNODES view to query the physical name of each nodein the configuration and the number of logical nodes assigned to each. (Note that theNodeListHash column is used simply to find whether an existing row with the sameconfiguration exists, since the XML column cannot be used directly in a lookup.)

Listing 8. SQL Example: Querying the PARALLELCONFIGNODES view-- Get the names of all job runs started on host H that used a node named N-- and ran between times YYYY-MM-D1 HH:MM:SS and YYYY-MM-D2 HH:MM:SS.SELECT R.RunStartTimestamp, X.ProjectName, X.JobNameFROM DSODB.JOBRUN AS RJOIN DSODB.PARALLELCONFIGNODES AS N ON R.CONFIGID = N.CONFIGIDJOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBIDJOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTIDWHERE H.HostName = 'H' AND N.PhysicalName = "N" AND R.RunStartTimestamp >= "YYYY-MM-D1 HH:MM:SS" AND R.RunEndTimestamp <= "YYYY-MM-D2 HH:MM:SS"ORDER BY R.RunStartTimestamp

Constraints are set so that deleting a row in HOST deletes all PARALLELCONFIGentries generated by runs on that host (and all the runs are deleted at thesame time). However, deleting entries in JOBRUN does not delete entries inPARALLELCONFIG; they might be linked to by a later run of another job on that host.

Listing 9. SQL Example: Deleting from the PARALLELCONFIG table-- Delete all rows in PARALLELCONFIG that are not referenced-- by any row in JOBRUN.DELETEFROM DSODB.PARALLELCONFIGWHERE NOT EXISTS ( SELECT * FROM DSODB.JOBRUN AS R JOIN DSODB.PARALLELCONFIG AS C ON R.CONFIGID = C.CONFIGID )

Page 10: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 10 of 24

Tables related to job run usage

The JOBRUNUSAGE table is only populated if the DSODBConfig.cfg file has theproperty JobRunUsage=1, which is the default setting (Figure 5). This table is usedby the Operations Console UI to generate a graph showing the progress of a job runover time.

Figure 5. Key columns and relationships between tables related toJOBRUNUSAGE

The JOBRUNUSAGE table and JOBRUNTOTALROWSUSAGE view

During the course of a server or parallel job run, the number of rows read and writtenby its source and target stages (if any) is collected on a timed basis. So a snapshotof the volume of data processed by the run is built up at intervals, and can be usedto graph its progress over time. Each row in JOBRUNUSAGE contains an XMLstructure containing a set of snapshots taken at increasing elapsed times since thestart of the job. The set is identified by the StartTimestamp column, which givesthe UTC time at which the set of snapshots starts, and the RUNID column, whichis a foreign key to the JOBRUN table to say which run this relates to. These rowsshould be queried through the JOBRUNTOTALROWSUSAGE view. This expands thesnapshot sets into individual snapshots that give the elapsed time since the start ofthe run, and the values of TotalRowsConsumed and TotalRowsProduced at that pointin the run.

Page 11: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 11 of 24

Listing 10. SQL Example: Querying the JOBRUNTOTALROWSUSAGE view

-- List the total rows consumed and produced, for each recorded interval -- by ascending time since the start of the run, for the run of -- job J in project P of host H that started at time YYYY-MM-DD HH:MM:SS. SELECT X.ProjectName, X.JobName, U.RunElapsedSecs, U.TotalRowsConsumed, U.TotalRowsProduced FROM DSODB.JOBRUN AS R JOIN DSODB.JOBRUNTOTALROWSUSAGE AS U ON R.RUNID = U.RUNID JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID WHERE H.HostName = "H" AND X.JobName = "J" AND X.ProjectName = '"P" AND R.RunStartTimeStamp = "YYYY-MM-DD HH:MM:SS" ORDER BY U.RunElapsedSecs

Tables related to stage and link monitoring

These tables are only populated if the DSODBConfig.cfg file has the propertyMonitorLinks=1 (by default, this is set to 0). At the end of each run of a parallel orserver job, the system collects information about the overall CPU used by eachstage, and the total number of rows processed on each link that is a source or targetof the job (Figure 6). This information is aggregated into the JOBRUN table, and therow count aggregates are also added periodically to the JOBRUNUSAGE table, forwhich see the previous section.

Page 12: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 12 of 24

Figure 6. Key columns and relationships between tables related toJOBRUNSTAGE and JOBRUNLINK

The JOBSTAGE tableA row appears in this table for each stage to be reported on for a job run. It is createdif it does not exist and relates back to the JOBEXEC entry for the current run viaits JOBID foreign key. The entry contains static information about the stage, suchas name, description and stage type. Note that since stage names are only uniquewithin the top level of the job, or a container if they are at a lower level, there is aContainerPath column that might need to be used to fully identify the stage as well asthe StageName column. Each row is given a surrogate key so it can be joined to itslinks.

The JOBLINK tableThis contains rows for each link determined to be a source or target link for an overalljob that has been run. It uses foreign key FROMSTAGEID and TOSTAGEID topoint back to the row in JOBSTAGE table that describes the stage it connects to

Page 13: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 13 of 24

in its role as a source or target. This depends on the setting of the IsSource andIsTarget columns; in general, only one is set and only one of the foreign keys iscorrespondingly related.

The LinkName is unique, in combination with FROMSTAGEID/TOSTAGEID, since astage cannot have two links with the same name.

The JOBRUNSTAGE table

Every run that is monitored in this mode generates one row per stage, to record,among other things, the number of CPU milliseconds that can be ascribed to thestage. A parallel job might be running each stage in a partitioned mode, whichinvolves more than one process. Therefore, the NumInstances column might begreater than 1, and the InstanceCPUList might be a comma-separated string ofnumbers rather than a single value. But it is easier to select the TotalCPU column,which gives the overall value by summing entries in the list, if necessary.

Listing 11. SQL Example: Querying the JOBRUNSTAGE table

-- List the stages of jobs run on host H in descending order of CPU used. SELECT X.ProjectName, X.JobName, R.RunStartTimestamp, JS.ContainerPath, JS.StageName, RS.TotalCPU FROM DSODB.JOBRUN AS R JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID JOIN DSODB.JOBSTAGE AS JS ON X.JOBID = JS.JOBID JOIN DSODB.JOBRUNSTAGE AS RS ON JS.STAGEID = RS.STAGEID WHERE H.HostName = "H" ORDER BY RS.TotalCPU DESC

The JOBRUNLINK table

Every run of a job that has source or target links also generates one row persuch link. This records the number of rows that passed down the link (these rowscontribute to the "consumed" or "produced" total row counts if the link is marked as asource or target, respectively). As for stages, parallelization is recorded via a comma-separated list of row counts per partition, where relevant, and the TotalRows columnis always the sum over all partitions.

Page 14: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 14 of 24

Listing 12. SQL Example: Querying the JOBRUNLINK table -- List the links, and the stages to which they are attached, of jobs -- run on host H in descending order of number of rows produced. SELECT X.ProjectName, X.JobName, R.RunStartTimestamp, JS.ContainerPath, JS.StageName, JL.LinkName, RL.TotalRows FROM DSODB.JOBRUN AS R JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID JOIN DSODB.JOBSTAGE AS JS ON X.JOBID = JS.JOBID JOIN DSODB.JOBRUNSTAGE AS RS ON JS.STAGEID = RS.STAGEID JOIN DSODB.JOBLINK AS JL ON (JS.STAGEID = JL.FROMSTAGEID OR JS.STAGEID = JL.TOSTAGEID) JOIN DSODB.JOBRUNLINK AS RL ON R.RUNID = RL.RUNID WHERE H.HostName = "H" AND JL.IsTarget = 1 ORDER BY RL.TotalRows DESC

The DATALOCATOR tableA link can also record "locator" information. This involves providing eight stringsthat identify the location of an external data resource. So for a sequential file stage,these strings should identify a file-system path name; for a database table, thisshould be a database type, name, schema, and table name. Exact details dependon the stage type. For each unique set of identifiers, there is a separate row in theDATALOCATOR table, with a surrogate key in the LOCATORID column. Each linkthat has such information attempts to see if a row with the same identifiers has beencreated and creates it if not. The LOCATORID column in the JOBRUNLINK table isthen a foreign key to that row.

Table 1. The DATALOCATOR columns are used as followsColumn name Usage

ComputerName Name of system that hosts the resource

SoftwareProductName Name of product that manages the resource

DataStoreSubClass Overall type of the data store

DataStoreName Overall name of database or path to top level of resource

DataSchemaSubClass Type of next level of resource if applicable

DataSchemaName Name of next level of resource if applicable

DataCollectionSubClass Type of lowest level of resource

DataCollectionName Name of lowest level of resource, such as table or file

Listing 13. SQL Example: Querying the DATALOCATOR table -- List the links of all runs that have written to a database named D, -- what the table names were and how many rows were written. SELECT X.ProjectName, X.JobName, R.RunStartTimestamp, JS.ContainerPath, JS.StageName, JL.LinkName, DL.DataCollectionName AS TableName, RL.TotalRows FROM DSODB.JOBRUN AS R JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID

Page 15: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 15 of 24

JOIN DSODB.JOBSTAGE AS JS ON X.JOBID = JS.JOBID JOIN DSODB.JOBRUNSTAGE AS RS ON JS.STAGEID = RS.STAGEID JOIN DSODB.JOBLINK AS JL ON (JS.STAGEID = JL.FROMSTAGEID OR JS.STAGEID = JL.TOSTAGEID) JOIN DSODB.JOBRUNLINK AS RL ON R.RUNID = RL.RUNID JOIN DSODB.DATALOCATOR AS DL ON RL.LOCATORID = DL.LOCATORID WHERE H.HostName = "H" AND JL.IsTarget = 1 AND DL.DataStoreName = "D" ORDER BY TableName, RL.TotalRows DESC, R.RunStartTimestamp

Table for host system descriptionsNote that rows are only recorded in the HOSTDETAIL table (Figure 7) when theengine is configured to run resource monitoring; that is, ResourceMonitor is not set to0 in DSODBConfig.cfg.

Figure 7. Key columns and relationships between the HOST and HOSTDETAILtables

The HOSTDETAIL tableEvery row in the HOST table should have at least one row from the HOSTDETAILtable related to it. These rows contain information about the operating systemproperties that were in force at some time when an instance of the ResMonAppprocess started. The CreatedTimestamp column is the UTC time when the particularset of properties was first inserted. The LastCheckedTimestamp is the UTC timewhen an instance of ResMonApp last determined that the properties of that particularsystem had not changed since the previous check (i.e., none of the values in any ofthe other columns needed to be changed). If any column would require a differentvalue, a new row is inserted, rather than the exiting row being updated. The new rowhas its CreatedTimestamp and LastCheckedTimestamp set to "now." The rows forma timeline that can be used to track changes to the operating system properties of aparticular node.

The entries in HOSTDETAIL are related to the HOST table by two foreign keys.HOSTID is the primary key of the HOST row that identifies which host name thisset of detail records describes. HEAD_HOSTID is the primary key of the HOST rowthat identifies the host on which the instance of ResMonApp that inserted the detailrecord is running. In the case of a conductor node, HOSTID and HEAD_HOSTID arethe same; for a remote node, they are different. In the latter case, the detail recordis recording the properties of a remote node as seen from one particular conductor

Page 16: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 16 of 24

node. Note that the unique primary key for this table is the combination of HOSTID,HEAD_HOSTID, and CreatedTimestamp, since any change to columns other thanLastCheckedTimestamp results in a new row being inserted, rather than this rowbeing updated.

Listing 14. SQL Example: Querying the HOSTDETAIL table 1

-- List operating system name and version for host H as they were at time -- YYYY-MM-DD HH:MM:SS (this means finding the row whose CreatedTimestamp -- is nearest to that time). SELECT Max(HD.CreatedTimestamp) AS "Last Changed", HD.PlatformName, HD.PlatformVersion FROM DSODB.HOSTDETAIL AS HD JOIN DSODB.HOST AS H ON HD.HEAD_HOSTID = H.HOSTID WHERE H.HostName = "H" AND HD.CreatedTimestamp <= "YYYY-MM-DD HH:MM:SS" GROUP BY HD.PlatformName, HD.PlatformVersion

Listing 15. SQL Example: Querying the HOSTDETAIL table 2

-- List most recent operating system details for each node of an engine -- monitored from host H. SELECT H2.HostName AS "Node", Max(HD.CreatedTimestamp) AS "Last Changed", HD.PlatformName, HD.PlatformVersion FROM DSODB.HOSTDETAIL AS HD JOIN DSODB.HOST AS H ON HD.HEAD_HOSTID = H.HOSTID JOIN DSODB.HOST AS H2 ON HD.HOSTID = H2.HOSTID WHERE H.HostName = "H" GROUP BY H2.HostName, HD.PlatformName, HD.PlatformVersion

Tables related to system resource usage

These tables hold information about how much CPU, memory, or other systemresource is in use at particular times (Figure 8). Note that rows are only recordedin the RESOURCESNAP and RESOURCEUSAGE tables when the engine isconfigured to run resource monitoring (i.e., ResourceMonitor is not set to 0 inDSODBConfig.cfg).

Page 17: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 17 of 24

Figure 8. RESOURCESNAP and RESOURCEUSAGE tables; key columns andrelated views

The RESOURCESNAP tableThis table only contains one row per host that is being monitored from a particularengine. Its HOSTID column is a foreign key to an entry in the HOST table toidentify which system's resources are described. The HEAD_HOSTID columnin RESOURCESNAP is also defined as a foreign key that identifies whichengine system inserted the row. In the case of a conductor node, HOSTID, andHEAD_HOSTID are the same; for a remote node, they are different.

On each engine (corresponding to a HEAD_HOSTID value), for each system itis monitoring including itself (the HOSTID values), a row is updated at intervalscontaining all the system resource usage information being collected for that system.The LastUpdateTimestamp column tells you when the last update took place.

The content of RESOURCESNAP should always be queried through either of theviews RESOURCESNAPSYSTEM or RESOURCESNAPDISKS.

For the full set of columns available via these views, please see the schemadocument reference in the Resources section.

The RESOURCESNAPSYSTEM viewThis view presents several columns that hold various values for CPU, free memory,process counts, and paging. The names of the columns are used to group themcounters. So CPUPctxxxx refers to percentage of CPU used; MemFreeKBxxxxrefers to free memory in kilobytes; ProcNumxxxx refers to number of processes;PageNumxxxx refers to number of paging events.

Page 18: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 18 of 24

For example, CPUPctUser is the percentage of the CPU assigned to user processesin the interval leading up to LastUpdateTimestamp; MemFreeKBPhysical is thenumber of kilobytes of physical memory that were free in that period.

The RESOURCESNAPDISKS view

This view expands a repeating field in RESOURCESNAP, which is controlled bythe number of disk paths, if any, that have been entered in the DSODBConfig.cfgconfiguration file (properties ResourceLocalFS and ResourceRemoteFS, which canbe repeated). The DiskPathMonitored column identifies each file system path, andthe DiskTotalKB and DiskFreeKB columns give the total number of kilobytes andnumber currently unused for the disk mounted on that path.

The RESOURCEUSAGE table

This contains counters derived from those in the RESOURCESNAP table,aggregated over intervals and arranged as a timeline. As an entry inRESOURCESNAP is updated, its values are used to calculate the latest maximum,minimum, and average numbers over the most recent time period, then at intervalsa row is inserted to describe the behavior over that period. Using the defaultconfiguration settings the RESOURCESNAP table is updated every 10 seconds,and every 60 seconds, a new row is inserted in RESOURCEUSAGE containing themaximum, minimum, and averages of the last six updates for that combination ofHOSTID and HEAD_HOSTID. The StartTimestamp column of this table can be usedto sort the entries into time order.

The RESOURCEUSAGESYSTEM view

For each column of RESOURCESNAP that holds system counters, there are threecolumns, with the suffixes Avg, Max, and Min. So CPUPctUserAvg is the averagevalue derived form the CPUPctUser column, and MemFreeKBPhysicalMin is theminimum value of the MemFreeKBPhysical column over the period as recordedbetween StartTimestamp and EndTimestamp.

Listing 16. SQL Example: Querying the RESOURCEUSAGESYSTEM view -- List average user CPU percentage usage for engine system H -- between times YYYY-MM-D1 HH:MM:SS and YYYY-MM-D2 HH:MM:SS. SELECT RS.StartTimestamp, RS.CPUPctUserAvg FROM DSODB.RESOURCEUSAGESYSTEM AS RS JOIN DSODB.HOST AS H ON RS.HEAD_HOSTID = H.HOSTID WHERE H.HostName= "H" AND RS.StartTimeStamp >= "YYYY-MM-D1 HH:MM:SS" AND RS.EndTimeStamp <= "YYYY-MM-D2 HH:MM:SS"

The RESOURCEUSAGEDISKS view

This allows you to query the average, minimum, and maximum number of freekilobytes for each monitored disk path over each monitored period.

Page 19: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 19 of 24

Listing 17. SQL Example: Querying the RESOURCEUSAGEDISKS view

-- List any periods where the disk free KB for any paths -- for host N as monitored from host H fell below 1 MB. SELECT RD.DiskPathMonitored, RD.StartTimestamp, RD.EndTimestamp, RD.DiskFreeKBMin FROM DSODB.RESOURCEUSAGEDISKS AS RD JOIN DSODB.HOST AS H ON RD.HEAD_HOSTID = H.HOSTID JOIN DSODB.HOST AS H2 ON RD.HOSTID = H2.HOSTID WHERE H.HostName = "H" AND H2.HostName = "N" AND RD.DiskFreeKBMin < 1024

Using the lookup tables to expand codesThere are a number of tables with columns that hold enumerated code values,designed as three-character strings that are all uppercase ASCII and vaguelymnemonic (Figure 9). These can be expanded into a more readable form if requiredby using the MASTERREF table and the views built from it.

Figure 9. MASTERREF table; key columns and related views

The MASTERREF table

This contains a row for each distinct value of each enumerated type. Thecombination of the Enumeration and Code columns forms the primary key that canbe used by the following views to look up a specific code value. The Name andDescription columns then give you correspondingly longer strings that can be used inreports instead of the code value itself.

The xxxxREF Views

Each view corresponds to a table and column that contains a code, as shown inTable 2.

Page 20: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 20 of 24

Table 2. Reference views and the columns they describe

View name Table Column

RUNMAJORSTATUSREF JOBRUN RunMajorStatus

RUNMINORSTATUSREF JOBRUN RunMinorStatus

RUNTYPEREF JOBRUN RunType

JOBTYPEREF JOBEXEC JobType

LOGTYPEREF JOBRUNLOG LogType

STAGESTATUSREF JOBRUNSTAGE StageStatus

LINKTYPEREF JOBLINK LinkType

For example, if a row of the JOBRUN table contains "FIN" in the RunMajorStatuscolumn, this can be looked up via the RUNMAJORSTATUSREF view as Name ="Finished" and Description = "Run has finished". A RunMinorStatus of "FWF"can be looked up via RUNMINORSTATUSREF as Name = "Finished aborted" andDescription = "Run has finished and logged at least one fatal message".

Listing 18. SQL Example: Querying the JOBRUN table using lookups againstreference views -- List the job names , job types, run types, and minor status names -- of all runs on host H that have finished. SELECT X.ProjectName, X.JobName, JT.JobTypeName, RT.RunTypeName, R.RunStartTimestamp, R.RunEndTimestamp, SR.MinorStatusName FROM DSODB.JOBRUN AS R JOIN DSODB.JOBEXEC AS X ON R.JOBID = X.JOBID JOIN DSODB.HOST AS H ON X.HOSTID = H.HOSTID JOIN DSODB.JOBTYPEREF AS JT ON X.JobType = JT.JobTypeCode JOIN DSODB.RUNTYPEREF AS RT ON R.RunType = RT.RunTypeCode JOIN DSODB.RUNMINORSTATUSREF AS SR ON R.RunMinorStatus = SR.MinorStatusCode WHERE H.HostName = "H" AND R.RunMajorStatus = "FIN" ORDER BY R.RunStartTimestamp

ConclusionWe have examined how the tables that make up the Operations Database schemacontain rows that describe job runs, host systems, and system resource usage.We have described the identifying key fields for the main tables, some of the otherimportant fields, and how foreign key fields are used to join other tables to them.We have given several examples of real-world questions that can be answered byappropriate use of these fields.

AcknowledgementsThanks go to the members of the DataStage Operations Console developmentteam who ensured that the schema got ported to the various flavors of database

Page 21: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 21 of 24

supported: DB2® LUW 9.5 and 9.7, Oracle 10g and 11gR2, Microsoft® SQL Server2005 and 2008, on the various Windows® and UNIX® platforms.

Thanks also to the QA team who patiently checked that data could indeed beinserted and retrieved correctly from all those variants.

Page 22: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 22 of 24

ResourcesLearn

• The operations database schema is documented on the IBM InfoCenter.• IBM InfoSphere Information Server 8.7 What's New whitepaper, containing

details of the Operations Console.• IBM DB2 9.7 for Linux, UNIX, and Windows Information Center and SQL

reference.• Learn more about Information Management at the developerWorks Information

Management zone. Find technical documentation, how-to articles, education,downloads, product information, and more.

• Stay current with developerWorks technical events and webcasts.• Follow developerWorks on Twitter.

Get products and technologies

• Build your next development project with IBM trial software, available fordownload directly from developerWorks.

• Now you can use DB2 for free. Download DB2 Express-C, a no-charge versionof DB2 Express Edition for the community that offers the same core datafeatures as DB2 Express Edition and provides a solid base to build and deployapplications.

Discuss

• Check out the developerWorks blogs and get involved in the developerWorkscommunity.

Page 23: Schema Thatdefines the Operations Database

ibm.com/developerWorks/ developerWorks®

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 23 of 24

About the authors

Len Greenwood

Len Greenwood was a member of the small development team thatproduced the first version of DataStage in 1996, prior to it beingacquired from Ascential Software by IBM in 2005. It now forms amainstay of the IBM InfoSphere Information Server suite. He hasworked in the related areas of data and metadata integration for thepast 15 years and is currently the main product architect for the corecomponents of the DataStage and QualityStage development andproduction tools. He recently designed the database schema thatunderlies the Information Server Operations Console, used to monitoractivity at the DataStage engine level.

Arron Harden

Arron Harden is a senior software engineer for IBM InfoSphereDataStage and QualityStage. Staying with the DataStage product afterseveral mergers and acquisitions, he has worked on DataStage formore than 12 years, joining IBM through the acquisition of AscentialSoftware Inc in 2005. Having spent a year working in Boston, he iscurrently based in the United Kingdom, working at the IBM MiltonKeynes office. In his most recent role, he was the lead developer forthe web application component of the DataStage and QualityStageOperations Console, written using the Dojo toolkit.

Geoff McClean

Geoff McClean was on the original DataStage development teamat its inception and is currently a senior software developer for corecomponents of the IBM InfoSphere DataStage and QualityStagedevelopment and production tools, part of the IBM InfoSphereInformation Server suite. He oversaw the implementation of thedatabase management, event handling, and resource tracking servicesof the IBM InfoSphere DataStage and QualityStage Operations Console.

Page 24: Schema Thatdefines the Operations Database

developerWorks® ibm.com/developerWorks/

Get started with the IBM InfoSphere DataStage andQualityStage Operations Console Database, Part 1: Anintroduction

Page 24 of 24

Sumit Kumar

Sumit Kumar has 13 years of industry experience, including thefinancial, banking, telecom, supply-chain management, insurance, andhealthcare domains. He has worked with IBM InfoSphere InformationServer from early 2010, including the Operations Console feature ofInformation Server, from the beginning of its design and developmentphase as a key contributor. He was involved in the implementation ofmost of the public APIs and service layers of the Operations Console.

© Copyright IBM Corporation 2012(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)