Autsys Diagnostic

Embed Size (px)

Citation preview

  • 8/6/2019 Autsys Diagnostic

    1/16

    Troubleshooting Tips for Unicenter AutoSys 4.5x

    This document provides tips and references for troubleshooting your Unicenter

    AutoSys 4.5.x implementation.

    Additional Resources

    The Unicenter AutoSys User Guide provides helpful troubleshooting

    information, particularly in Chapter 14 Troubleshooting, Appendix B

    Troubleshooting CCI and Appendix C General Debugging. In addition, you

    should also consult the following links on the CA Support Online website:

    Techdoc Index for AutoSyshttp://supportconnectw.ca.com/public/autosys/infodocs/autosys-menu.asp

    Notable troubleshooting topics include the following:

    TEC424427: I am getting the following message at my commandprompt: Unrecognized job type

    TEC411871: AutoSys 4.5.1 will not execute commands with a recent12.5.1 Sybase Client installed on Remote Agents. When running

    command (autorep -G all) on a Remote Agent you get following

    message; This application failed to start because libsybdb.dll not

    found.

    TEC411389: Error message received when trying to save a new job:"Unexpected Response from Listener with ATSYS 4.5/ UWCC

    TEC406557: Trying to start the eventor processor by running$AUTOSYS/bin/eventor. The event_demon fails to start and no real

    explanation or error is displayed as to why.

    TEC405710: eTrust Access Control Locks out the Telnet in UnicenterAutoSys 4.5

    TEC428525: Running "autoping" from the event processor machine toa newly installed Windows remote agent fails stating "read failed win

    socket error =10054". Attempts to telnet from the EP machine to the

    Remote Agent machine also fail.

    Implementation CD AutoSys sectionhttp://supportconnectw.ca.com/premium/unicenter/implementationcd/Aut

    oSys/Autosys_Frame.htm

    Note that, although this link contains older information, the basic AutoSys

    tips still apply.

    1

    http://supportconnectw.ca.com/public/autosys/infodocs/autosys-menu.asphttp://supportconnectw.ca.com/premium/unicenter/implementationcd/AutoSys/Autosys_Frame.htmhttp://supportconnectw.ca.com/premium/unicenter/implementationcd/AutoSys/Autosys_Frame.htmhttp://supportconnectw.ca.com/premium/unicenter/implementationcd/AutoSys/Autosys_Frame.htmhttp://supportconnectw.ca.com/premium/unicenter/implementationcd/AutoSys/Autosys_Frame.htmhttp://supportconnectw.ca.com/public/autosys/infodocs/autosys-menu.asp
  • 8/6/2019 Autsys Diagnostic

    2/16

    Troubleshooting Steps

    In troubleshooting your Unicenter AutoSys implementation (or any product

    implementation for that matter) you should take care to ensure your approach

    is consistent, repeatable, inquisitive and well documented. In general you willneed to:

    1. Define the problemClearly state what happened that should not have happened or what did

    not happen that should have happened. Be sure to note the scope of the

    problem including the date\time of occurrence, affected

    jobs\machines\users\network as well as any preceding jobs or recent

    activity on the machines involved in the transaction.

    2. Identify versions\patch levelDo this for all affected components and note any security\firewall settings

    that may be in effect.

    3. Confirm communicationVerify that the affected machines can talk to one another to determine if

    the fault lies with a possible network\firewall\permissions error

    4. Execute the job manuallyVerify that the job syntax is correct. This helps determine if the fault lies

    with the scheduling system or with the job itself.

    5. Check logs and system date\clockCheck to see what happened on the affected machine(s) and ensure thatthe system date\clock are correct especially when using date\time

    related job parameters such as start_times and RunWindow.

    Use the autosyslog command to view either the event processor log file or

    the Remote Agent log file for a specified job. Both the Remote Agent and

    Event Processor write diagnostic messages to their respective logs, as part

    of their normal operations and in response to detected error conditions.

    The syntax for this command is:

    autosyslog [-e | -J job_name] [-p]

    The event processor logs all events it processes and provides a detailed

    trace of its activities. The Remote Agents log displays the log for the

    specific jobs most recent run. Although the Remote Agents log file is

    automatically deleted by default after a successful job run, the log file will

    not be deleted at job completion if the job ended with a FAILURE status.

    The event processor log also contains a timestamped history of each event

    that occurs. Viewing this log is an alternative to monitoring all jobs and

    all events.

    2

  • 8/6/2019 Autsys Diagnostic

    3/16

    For more information on autosyslog, consult the Unicenter AutoSys

    Reference Guide for Windows and Unix.

    6. Document the solutionOnce solution has been applied take steps to prevent repeat. Typically,

    this involves education, development\documentation of standardized

    processes and conventions (e.g., naming conventions), or application of

    necessary patches (and establishing a process whereby this is not allowed

    to lapse.)

    Prevention

    Unicenter AutoSys provides a highly flexible, very customizable scheduling tool

    for managing your workload environment; however, you should be very aware

    of the implications and restrictions inherent in any of the scheduling options

    before employing them. For example, if a predecessor job regularly exceedsits term_run_time it can cause a ripple effect felt through the full chain of its

    dependent jobs.

    Ensure that you (and anyone else who will be scheduling jobs through

    AutoSys) understand the architecture (e.g., components and their

    relationships, firewall requirements, job submission authorizations, etc.) and

    follow agreed upon standards for defining and submitting jobs including file

    naming conventions and calendars.

    Note: Naming conventions for jobs, calendars, variables and views should be

    clearly established as part of the initial architecture and consistently enforced

    throughout the implementation.

    Job Related Errors

    In some cases, a jobs failure to execute has to do more with the job itself

    than the scheduling system. Therefore, one of your first troubleshooting steps

    should be to verify the validity of the job including its syntax and access to

    required resources.

    If the job failed because the command being executed by the job returned an

    error, run the AutoSys autorep J jobname -d and investigate why the Job

    abended:

    3

  • 8/6/2019 Autsys Diagnostic

    4/16

    In the example above, the command executed by the Job returned an exitcode of 1 upon completion (see Pri/Xit column). Notice that AutoSys

    attempted to run the Job twice (as seen in the Ntry colum which notes the

    number of restart attempts). At first, the job failed because the Remote Agent

    was not running (Connect to socket FAILED). However, that was

    corrected and AutoSys resubmitted the Job, which failed again for the same

    reason.

    Make sure that the correct syntax is provided to enable the command,

    executable, UNIX shell script, application, or batch file (and its parameters) to

    run on the Remote Agent Client (when all necessary conditions are met). Keep

    the following in mind when using the command attribute in Job definitions:

    You cannot redirect standard input, output, and error files in the commandattribute. Use other job attributes, such as std_in_file for standard

    input, to provide the necessary functionality.

    Environment variables for the command are defined by a default profile orthe profile specified in the job definition.

    Although system environment variables are automatically set in thecommands environment, user environment variables are not. You must

    define all other required environment variables in the jobs profile.

    If a command works properly when issued at the command prompt, butfails to run properly when specified as a command attribute, the necessary

    user-defined environment variables and the variables defined in the job

    profile are probably different. If this is the case, compare the variables toverify that all required user environment variables are defined in the job

    profile. Information on how to do this can be found in the User Guides.

    When specifying drive letters in job definitions, you must enclose the coloncharacter with quotation marks or backslashes. For example, C\:\tmp or

    "C:\tmp" is valid; C:\tmp is not.

    4

  • 8/6/2019 Autsys Diagnostic

    5/16

    Job Runs on Command Prompt but not through AutoSys

    If a command runs manually on a Windows command prompt but fails with a

    job returned =-1 when run via AutoSys check the system's PATH variable to

    see if it contains spaces in the path location to the command. Setting thecommand's bin location at the beginning of the PATH variable in the

    Administrator GUI "System" Environment Variables either surrounded by

    double-quotes or using "~1" in place of the portion of the PATH definition with

    "spaces" will allow the command to be found.

    Password and Permission Errors (Windows Only)

    Jobs can also fail because the jobs owner ID and/or password have not been

    defined to the AutoSys security or if it does not have permission to start a Job

    on a Client.

    When an Agent runs a job on a computer, it logs on as the user who owns the

    job. To enable the Agent to do this, the Scheduler passes both the job

    information and an encrypted version of the job owners password from the

    database to the Agent. You must ensure that the password you provide is

    valid!

    The EDIT Superuser can use autosys_secure interactively or from the

    command line to enter these IDs and/or passwords. After the EDIT Superuser

    enters the IDs and passwords, any user who knows an existing user ID and

    password can change the password or delete that user ID and password.

    In the following example the job could not run because user Autosys or its

    password had not been defined to the AutoSys security.

    5

  • 8/6/2019 Autsys Diagnostic

    6/16

    To remedy this first logon as the EDIT superuser and run autosys_secure:

    Select option 5: Manage AutoSys User@Host users. Then, select 1:

    Create AutoSys User@Host or Domain Password.

    autosys_secure will prompt for credentials. Enter the correct user name,

    host or domain, and password:

    6

  • 8/6/2019 Autsys Diagnostic

    7/16

    autosys_secure can also be executed fully at the command prompt without

    requiring interaction.

    Scheduling Problems in the Job Definition

    If you include scheduling options, such as max_run_alarm, term_run_time or

    run_window, it is critical that you understand how these parameters work and

    how long the job typically takes to run, particularly when there are many

    dependencies spanning multiple platforms and machines.

    7

  • 8/6/2019 Autsys Diagnostic

    8/16

    If a jobs starting conditions have not been met, run the AutoSys job_depends

    J jobname d command to see why it could not start at its start time:

    For example:

    In the example above, the jobs starting conditions had not been met because

    it can only run if its predecessor returns a 0 (exitcode=0). However, since

    the predecessor job was still running (and, therefore, had not yet returned a

    0) when the jobs date condition was met, it could not start. To avoid this

    type of problem, make sure that the jobs start_times attribute is set

    appropriately.

    In the following example the output of the job_depends J jobname d

    command shows that the jobs starting conditions have not been met because

    it can only run if its predecessor runs successfully. Since its predecessorfailed, it cannot be started.

    8

  • 8/6/2019 Autsys Diagnostic

    9/16

    Maximum and Minimum Run Time Errors

    If the job failed because it exceeded its maximum run time (specified through

    term_run_time) the job is taking longer than the specified time to finish, which

    might indicate that the job is stuck in a loop or is waiting for additional data.Therefore, you should:

    Make sure that the job is not stuck in a loop or waiting for data that hasnever arrived.

    Also, make sure that the maximum run time threshold is adequate.

    Note: Keep in mind that if you used the max_run_alarm attribute, exceeding

    the limit will send an alarm it will not cause the job to terminate.

    Conversely, a job might also fail to meet its minimum run time, finishing

    soonerthan expected, which could also indicate that it is not running properly.

    In this case you should:

    Make sure that the job is getting all the data it needs to run properly. Verify that the minimum run time threshold is adequate.

    Missed Run Window

    The run_window attribute controls only when the job starts not when it

    stops. If a job definition contains the run_window attribute, once the job

    becomes eligible to run (based on its starting conditions), Unicenter AutoSys

    JM verifies whether the specified run window includes the current time. If it

    does, the job starts. If it does not, the product determines whether to run the

    job based on the end of the previous run window and the beginning of the

    next run window.

    To see what happened, execute the following command:

    autorep J jobname d

    For example:

    9

  • 8/6/2019 Autsys Diagnostic

    10/16

    The run_window attribute is not, in itself, a starting condition it is an

    additional control over when a job may start after its starting conditions are

    satisfied. This attribute is especially useful, for example, when you do not

    know when a monitored file may arrive and there are specific times when a job

    dependent on the monitored file should not run.

    Therefore, make sure that the jobs condition (starting conditions),

    date_condition (date/time conditions), and run_window attributes are all set

    appropriately (for example, a run window cannot span more than 24 hours).

    Then, if the job is on hold, make sure to run sendevent E JOB_OFF_HOLD J

    jobname before the end of the run window.

    You should also consider the availability of resources required by the job. For

    example, notice that the Job below is queued and that it has a short run

    window.

    10

  • 8/6/2019 Autsys Diagnostic

    11/16

    This job may not start before the end of the run window because its load

    (job_load attribute) added to the load of the running job may exceed the

    max_load attribute of the machine they run on. In fact, that is exactly what

    happened in the example above:

    Here you can see that the job did not run because there were not enough

    resources available before the end of its run window.

    11

  • 8/6/2019 Autsys Diagnostic

    12/16

    Retries Limit

    When a job exceeds the maximum number of retries specified by n_retrys in

    the job definition or Max Restart Trys in the instance configuration, it exits

    with a failure status. The n_retrys attribute applies to applicationfailures for example, where AutoSys is unable to find a file or command, or

    where permissions were not properly set. It does not apply to system or

    network failures such as when a computer is unavailable or a socket

    connection has timed out. Job restarts after system or network failures are

    controlled by the Max Restart Tryssetting on the Unicenter AutoSys JM

    Administrator Scheduler window.

    The delay between restarts is determined by the Restart Constant and

    Restart Factor settings on the Unicenter AutoSys JM Administrator

    Scheduler window which are limited by the maximum specified by the Max

    Restart Wait setting.

    The following formula is used to calculate wait time:

    Wait Time = Restart Constant + (Max Restart Trys * Restart Factor)

    If Wait Time > Max Restart Wait, then WaitTime = Max Restart Wait.

    If necessary, define the number of times to attempt to restart the job after it

    exits with a failure status. The n_retrys value can be set to any integer

    ranging from 0 to 20 (default: 0 the job will not restart). For example:

    n_retrys: 3

    12

  • 8/6/2019 Autsys Diagnostic

    13/16

    specifies that the job will automatically restart up to five times after an

    application failure. This means that the job would start as scheduled, and if it

    fails, it would restart up to three times for a total of four attempts.

    Job Date\Time Conditions Not Met

    Make sure the job is scheduled according to its date/time condition. These are

    specified by the days_of_week, start_times, start_mins, and run_calendar

    attributes. Attempting to start the Job via sendevent E STARTJOB J

    jobname T MM/DD/YYYY HH:MM will result in a date/time condition failure.

    The Job report will show:

    In the example above you will see that job is scheduled to run on 08/07/2007

    at 21:46 (Job definition). However, it was manually scheduled to run on the

    present date at 22:55. The Event State (ES) is Processed (PD), but the Job

    Status (ST) is Inactive (IN).

    Term Run Time Limit Exceeded

    A Job may terminate because it either exceeded its term_run_time attribute,

    which designates the maximum run time (in minutes) that the job should

    require to finish normally or the job was killed with a command such assendevent E KILLJOB J jobname.

    When a job runs longer than the time specified by the term_run_time

    attribute it is terminated by Unicenter AutoSys JM.

    13

  • 8/6/2019 Autsys Diagnostic

    14/16

    Note: Under Windows, processes launched by user applications or batch

    (*.bat) files are not terminated. Unicenter AutoSys JM only terminates the

    CMD.EXE process that it used to launch the job. Otherwise, Unicenter AutoSys

    JM kills the process specified in the command definition. In UNIX, all child

    processes associated with the command process are killed.

    Define the maximum number of minutes the job should ever require to finish

    normally, if necessary. term_run_time can be set to any integer (default: 0

    the job is allowed to run forever). For example:

    term_run_time: 15

    specifies that the job will terminate if it runs for more than 15 minutes.

    Unable to Contact Machine

    In some cases a job does not execute because network problems, suchas name resolution errors, or firewall configurations prevent theAutoSys Scheduler from reaching the Job Management Agent in the

    first place. Use diagnostic tools, such as tracert and pathping, to helpdetermine problems such as broken links.

    Tools and Verification Checks

    CCI is used to facilitate communications between components and this is

    particularly critical when cross platform scheduling is in effect. To verify that

    the necessary CCI components are running, execute the following command:

    ccicntrl status

    Here you can see an example of the results:

    Depending upon the exact configuration of the machines in your environment,

    NR-Client may be running instead of NR-Server. Usually NR-Server is installed

    and remote is used for persistent connections. Therefore, at least two CCI

    components must be running: Transport and NR-Server.

    You should also make sure CCI is sending and receiving by using CCIR and

    CCIS utilities. For example:

    14

  • 8/6/2019 Autsys Diagnostic

    15/16

    If the required CCI components are running and there are no network related

    issues, verify that the Event Management components are running by

    executing the following commands:

    unifstat c evtd

    unifstat c evtr

    For example:

    15

  • 8/6/2019 Autsys Diagnostic

    16/16

    In the example above, the Event Management components which are essential

    to remediation, in fact, stopped. Re-start Event Management by running the

    following command:

    unicntrl start opr

    Diagnostic tools such as tracert and pathping can help determine problems

    such as broken links.

    16