55
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

Embed Size (px)

Citation preview

Page 1: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

© 2007 IBM Corporation

IBM Global Engineering Solutions

IBM Blue Gene/P

Environmentals

Page 2: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Environmentals

Environmentals Diagnostics

Service actions

RAS

Page 3: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Flowchart of hardware failure and actions

Diagnostics Service action Parts replacement

CE Call

Application Failure

specify failure parts

Blue Gene Administrators work

IBM CEs work

normal operation

Page 4: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

© 2007 IBM Corporation

IBM Global Engineering Solutions

IBM Blue Gene/P

Diagnostics

Page 5: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

What is “Diagnostics” for Blue Gene/P?

A set of hardware diagnostic test programs. Test programs to locate hardware failures. Diagnostics included for:

Memory subsystem Compute logic Instruction units Floating point units Torus and collective network Internal and external chip communication Global interrupts

Page 6: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Running diagnostics

Launched from a Navigator or from a shell. Diagnostics test cases are designed to test all aspects of the hardware.

There is a groups of test cases based on time or a specific type of hardware, called test buckets.

4 test buckets for the Navigator

(small, medium, large, complete)

12 test buckets for the command line.

(small, medium, large, complete, servicecard, nodecard, linkcard, memory, ionode, multinode, power, gi*) * global interrupts

diag run time can be accommodated by choosing the test bucket.

Details for the each test cases can be referenced from Redbook “Blue Gene/P System Administration, System Diagnostics Section”

Page 7: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Diagnostics from the Navigator

Diagnostics home page

In the Blue Gene Navigator the Diagnostics link consist of three tab,

•Summary tab• Default page for diagnostics

• Summary of the diagnostics result

• Shows status of current diag runs.

• “Configure New Diagnostics Run” Button to launch diag

•Locations tab• Provides a view of all the hardware that has had diags run on it.

• Can be filtered by “Filter Options”,

• Location

• Hardware Status

• Executed time

• Hardware replacement status

•Complete Runs tab• History of all completed diag run

• Can be filterd by “Filter Options”,,

• Diagnostics Status

• Run ended time

• Target midplanes, racks, blocks

Page 8: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Submitting a diagnostics run via NavigatorClick “Configure new Diagnostics Run” button.

1. Select the midplanes to test. Tests are run separately on midplane blocks. Several blocks are run simultaneously depending on the size of the service node and the number of Ethernet I/O channels involved (i.e. the number of rows).

2. Select either a predefined test bucket or individually select tests to run.

3. Select the Run Options,

Pset ratio override – use this option to specify a custom pSet ratio for the run. This is useful if, for instance, not all I/O nodes are cabled.

Stop on first error encountered – by default the diagnostics will not stop if a failure is detected. This will make the diagnostics stop once the first failure is found.

Save all output from diagnostics – by default diagnostics will not keep logs for a successful diagnostics test. This prevents the harness from deleting logs.

3

4

4. Click on “Start Diagnostics Run”

button to start.

Page 9: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Viewing results via Navigator

1. Click on Completed Runs

2. Each run is displayed on a line.

3. Log directory hyperlink is the main diags.log file.

4. End time hyperlink goes to the summary for that run.

1

34

Page 10: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Viewing results via Navigator cont.

1. Each line represents the results for the given midplane location.

2. The log directory link goes to the main diags.log file. It is the same link as in the run summary page.

3. The location link goes to a summary for that midplane location.

4. There is a button to automatically collect the diagnostics logs into an archive for sending into IBM Support.

32

4

Page 11: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Viewing results via Navigator cont.

1. Each line represents results for a particular test on this midplane location.

2. Result is the worst status of all the hardware tested in this midplane. This can be Passed, Marginal, Failed, or Unknown.

3. Run result is the status of the run. This can be Uninitialized, Running, Canceled, or Completed.

4. Each line has a set of counts for hardware status as determined by the test.

5. The Summary of failed tests hyperlink lists all test case failures at once.

• Details are listed by location and include a brief analysis of why the result was determined.

• A link to the log for the test is included if applicable.

• A link to the RAS events that occurred between the test’s start and end times for the given location is also included. Most diagnostics are RAS-based so this list should give good insight to the specifics of the failure. Keep in mind it is possible the harness deduces a hardware failure without RAS so there may not always be RAS events available.

Page 12: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Running diagnostics via the command line

The diagnostics script to run is /bgsys/drivers/ppcfloor/bin/rundiags.py. The --help option can be used to display the various command line options.

clappi@dd2sys1fen1:/bgsys/drivers/ppcfloor/bin> ./rundiags.py --helpBlue Gene Diagnostics Version 1.6 Build 3 running on Linux 2.6.16.21-0.8-ppc64 ppc64, dd2sys1fen1:9.5.45.45Blue Gene Diagnostics initializing...

Usage:rundiags.py --midplanes midplane_list [OPTIONS ... ]rundiags.py --racks rack_list [OPTIONS ... ]rundiags.py --midplanes midplane_list --racks rack_list [OPTIONS ... ]rundiags.py --blocks block_list [OPTIONS ... ]

--midplanes x,y,z... - a list of midplanes on which to run diagnostics (eg. R00-M0,R00-M1,R01-M0)--racks x,y,z... - a list of racks on which to run diagnostics. (eg. R00,R01,R10)--blocks x,y,z... - a list of blocks on which to run diagnostics.

Note: the list of midplanes, racks, or blocks must not contain any whitespace and are comma separated. Either --midplanes, --racks, or --blocks must be specified. Note: the --midplanes and --racks can both be specified together in the same command, but not with the --blocks. The --blocks switch must be specified in the absence of the other two.

OPTIONS (default values are displayed inside the [])--tests x,y,z - run only the specified tests. Additive with --buckets. [Run all tests]--buckets x,y,z - run only the tests in the specified buckets. Additive with --tests. [Run all tests]--stoponerror - stop on the first error encountered. [false]--sn x - name of the service node. [localhost]--csport x - control system port. [32031]--csuser x - control system user. [current user ==> clappi]--mcserverport x - mcServer port. [1206]--dbport x - DB2 port. [50001]--dbname x - database name. [bgdb0]--dbschema x - database schema. [bgpsysdb]

Page 13: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Diagnostics output

The diagnostics will save all output into the diagnostics log directory for the run.

Each run gets a log directory under /bgsys/logs/BGP/diags whose name is based off the run ID. This is where the main diags.log file is stored. e.g. /bgsys/logs/BGP/diags/071002_133909_28920

There is also a runscript_xxxxx.ksh file stored in the log directory. If this script is run it will re-run this particular diagnostics run including all options.

Each test per block gets a subdirectory for its logs. The name of the subdirectory is based off the block ID, test case name, and test start time. For example, bpclbist__DIAGS_R11-M1_144119362.

Page 14: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Diags.log

The diags.log file contains all harness output, including test result summaries, harness error dumps, and exception traces.

Example test summary:Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] svccardenv summary:

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] =================================================

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Run ID: 709281438572281

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Block ID: _DIAGS_R11-M1

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Start time: Sep 28 14:38:59

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] End time: Sep 28 14:39:20

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Testcase: svccardenv

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Passed: 0

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Marginal: 1

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] R11-M1-N01 (SN 42R7201YL10K708607R): Power module R11-M1-N01-P3 continuously failed to share current for the 1.8V domain while the total 1

.8V current draw for the node card was above 30.0A. It has failed to share current properly at least 2 times during this diagnostic.

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Failed: 1

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] R11-M1-S (SN 42R7504YL10K711001P): R11-M1-S-P5 indicated driver faults. Driver fault byte is 0xFF. Faults: 0x01 -> forced fault, 0x08 -> o

ver current, 0x10 -> over voltage, 0x40 -> incompatible V_SEL, 0x80 -> general fault).

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] 1 x (INFO:DIAGS:DIAG_COMMON:DIAG_8000) Diagnostic test Service_Card_Environmental passed.

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Unknown: 0

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Hardware status: failed

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Internal test failure: false

Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] =================================================

Each line in the log has a timestamp and severity level as with all other BG/P logs. After this the square brackets contain the name of the thread logging the message and the harness verbosity level for the message.

Page 15: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

© 2007 IBM Corporation

IBM Global Engineering Solutions

IBM Blue Gene/P

Service Actions

Page 16: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

What is a Service Action?

There will be occasions that you will need to cycle power or replace hardware on your Blue Gene/P system. Any time service on the Blue Gene/P hardware is to be performed, the hardware needs to be prepared by generating a Service Action

The Service Actions are generated from, Command line

Blue Gene Navigator

Page 17: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Command line Service Action

Actions

Prepare Prepare the card for service.

End End the Service Action and make the card functional.

Close Force an existing open or prepared Service Action to the closed state.

Service Action commands ( Located in /bgsys/drivers/ppcfloor/bareMetal/ ) and target locations.

ServiceBulkPowerModule Rxx-B-Px.

ServiceFanModule Rxx-My-Az.

ServiceLinkCard Rxx-My-Lz.

ServiceMidplane Rxx-My.

ServiceNodeCard Rxx-My-Nzz or Rxx-My-N

ServiceRack Rxx. To end the service action, the CE must reseat one of the BPMs to provide enough power to the master service card. Do not use ServiceRack to power cycle a rack.

ServiceClockCard Rxx-K. To service a Clockcard or to prepare the rack so that the bulk power breaker can be manually turned off.

Syntax for the Service Action

<ServiceActioncommand> <Location> <Action> <options>

Page 18: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Command line Service Action cont.

Optional parameters

--user <userid> The user name to be specified when connecting to mcServer. If a user is not specified the default will be the user that issued the command.

--msglevel <verbose|debug>

Controls the output message level. Verbose - Detailed output messages Debug - Includes all of the output from Verbose

--dbproperties <path>

The fully qualified file name for the db.properties file. The default is/bgsys/local/etc/db.properties

--base <path> The base (install) path to the files used by this command. The default is /bgsys/drivers/ppcfloor.

--log <path> The path to the directory that will contain the log file from thiscommand. The default is /bgsys/logs/BGP.

--help Display help.

Checks are done for conflicting service actions.

A rack service action does not allow you to power cycle the rack.

Page 19: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

TBGPServiceAction Table (1 of 2)

SERVICEACTION field contains the current state of the service action.

STATUS field shows the current status of the service action. INFOSERVICEACTION field provides status information related to the

service action. Displayed from Navigator. LOGFILENAMEPREPAREFORSERVICE field contains the fully

qualified path name to the log file used when preparing the hardware for service.

LOGFILENAMEENDSERVICEACTION field contains the fully qualified path name to the log file used when ending the service action.

Page 20: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

TBGPServiceAction Table (2 of 2)

SERVICEACTION fieldINITIALIZED

OPEN

PREPARE

END

CLOSED

STATUS field“I” – Initialized

“O” – Open

“P” – Prepared for service

“A” – Actively processing

“E” – Action ended in error

“C” – Closed

“F” – Forced closed

“S” – Service (hw only)

Page 21: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Service Action Logs

Naming format syntax:<service action>-<location>-<timestamp>.log

Example: ServiceNodeCard-R00-M0-N00-2007-10-03-13:32:25.log

Log files are stored in the /bgsys/logs/BGP directory. Log files can be stored in a different directory by using the –log

<path> optional parameter on the service action command. Log file names are stored in the database entry for the service

action.

Page 22: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Prepare For Service Flow (1 of 3)

Check for conflicting service actions in progress. Ensure that there are no open service actions in progress. End any jobs and free any blocks that use the hardware being

serviced. Open service action. A new database entry is created and the Service

Action ID is assigned. Set the Service Action entry to PREPARE state and ACTIVE status. Send RAS event indicating that the service action has been started

Page 23: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Prepare For Service Flow (2 of 3)

Open a target set for the hardware to be serviced. Set the hardware status to SERVICE in the database.

Using the appropriate ReadCardInfo API, obtain the current information for the hardware, primarily VPD.

Perform RAS analysis for hardware that has IBM supplied VPD. Update the card’s VPD if RAS data is available. Update database VPD and status.

Page 24: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Prepare For Service Flow (3 of 3)

Prepare the hardware for service. Power off the hardware (if required) Set “intervention required” in LEDs Send RAS event indicating that the hardware is ready to be serviced.

Set the Service Action entry to PREPARE state and PREPARED status.

Perform the required service action. If a failure occurs:

The Service Action entry is set to PREPARE state and ERROR status. Send RAS event indicating that the service action failed.

Page 25: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

End Service Action Flow (1 of 4)

Ensure that the service action is PREPARED. Set the service action to END state and ACTIVE status in the

database. Open a target set for the hardware being serviced. Make the hardware functional.

Initialize the hardware using the appropriate InitializeCards API. Set the status for the serviced hardware to MISSING in the

database.

Page 26: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

End Service Action Flow (2 of 4)

Update the database with information for the serviced hardware. Using the appropriate ReadCardInfo API, obtain the current information

for the hardware. Validate the hardware information: VPD, ECIDs (Node and LinkChips),

Node memory size, memory module size, and voltage. Update database entry based on the validation results.

– Set to ERROR status if invalid data received– Set to ACTIVE status if hardware was not replaced.– Set to SERVICE status if the hardware was replaced.

Page 27: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

End Service Action Flow (3 of 4)

Verify replaced hardware is functional Run select diagnostic tests on replaced hardware (database

status is SERVICE).

– Service cards, link cards, node cards, and nodes

Update hardware status based on diagnostic results

– Success: Set hardware status to ACTIVE and send RAS event indicating that the hardware is functional.

– Failed: Set hardware status to ERROR

Page 28: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

End Service Action Flow (4 of 4)

Validate the database configuration. Check serviced hardware for a status of ERROR.

Fail service action if any hardware is found in ERROR status. Send RAS event indicating that the service action is closed. Close the service action. If a failure occurs:

The Service Action entry is set to END state and ERROR status.

Send RAS event indicating that the service action failed.

Page 29: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Close Service Action Flow

A service action can be forced closed if it has the follow state and status combinations:

OPEN state, OPEN status PREPARE state, ERROR status PREPARE state, PREPARED status END state, ERROR status

A service action with an ACTIVE status will be set to ERROR status if a failure occurs. The state is unchanged.

A RAS event is sent indicating that the service action was forced closed.

Page 30: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Service Action Component Overview

ServiceClockCard

ServiceRack

ServiceMidplane ServiceMidplane

ServiceNodeCard ServiceLinkCard

ServiceAction

ServiceBulkPowerModule ServiceFanModule

Page 31: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceBulkPowerModule

There can not be multiple Bulk Power Module (BPM) service actions active within a rack.

A service action can not be started for a BPM if there is more than 1 failed BPM in the rack (a rack service action is required) unless the BPM to be serviced is one of the failed BPMs.

A service action can not be started for a BPM if there is an open rack or clock card service action that contains the BPM.

Software will attempt to turn off the BPM being serviced. If power can not be turned off, the service action will continue without error.

Page 32: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceFanModule

There can not be multiple fan module (FM) service actions active within a midplane.

A service action can not be started for a FM if there is more than 1 failed FM in the midplane (a midplane service action is required) unless the FM to be serviced is one of the failed FMs.

A service action can not be started for a FM if there is an open midplane service action that contains the FM.

Software does not power off the FM, instead it attempts to flash the intervention LED.

The replacement of a FM does not have any effect on jobs using the midplane (BGL difference).

Page 33: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceNodeCard (1 of 2)

Multiple individual node card service actions (Rxx-My-Nzz) are allowed within midplane Rxx-My.

The Rxx-Mx-N option (used on ServiceNodeCard, a midplane or a rack service action) allows any or all node cards to be serviced within midplane Rxx-My.

A new service action for Rxx-My-Nzz will conflict if there is an existing open service action for Rxx-My-Nzz, Rxx-My-N, Rxx-My, Rxx, or Rxx-K.

Page 34: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceNodeCard (2 of 2)

RAS analysis is done for all service actions that include NodeCards. RAS analysis for Nodes is only done for a node card service action.

Diagnostic testing will be done to ensure that replaced NodeCards and Nodes are functional.

Nodes will be set to ERROR status if the ECID value read from the card does not match the VPD value.

Nodes with differing memory size, memory module sizes, or voltages will be set to ERROR status.

A NodeCard containing Nodes with ERROR status will be set to ERROR status.

Page 35: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceLinkCard (1 or 2)

Multiple individual link card service actions (Rxx-My-Lz) are allowed within a midplane.

The Rxx-Mx-L option (used by a midplane or a rack service action) allows any or all link cards to be serviced in the midplane Rxx-My.

A new service action for Rxx-My-Lz will conflict if there is an existing open service action for Rxx-My-Lz, Rxx-My, Rxx, or Rxx-K.

It is safe to service a LinkCard without having to perform a service action on the other link cards in the same row or column (BGL difference).

Page 36: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceLinkCard (2 of 2)

The midplane containing the link card being serviced is set to SERVICE status to prevent the job scheduler from running.

RAS analysis is done for all service actions that include a LinkCard.

Diagnostic testing will be done to ensure that the replaced LinkCard is functional. This includes cable verification.

LinkChips will be set to ERROR status if the ECID value read from the link chip does not match the VPD value.

A LinkCard containing LinkChips with ERROR status will be set to ERROR status.

Page 37: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceMidplane (1 of 2)

A midplane service action allows any hardware, with the exception of a bulk power module, associated with the midplane to be serviced.

NOTE: ServiceCards may be serviced (BGL difference). A service action can not be started for Rxx-My if there are any open

service actions within midplane Rxx-My or there is an open rack or clock card service action for Rxx, or Rxx-K.

A new service action for a node card, link card, fan module, rack, or clock card will conflict with an open midplane service action if that hardware is associated with the midplane being serviced.

Page 38: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceMidplane (2 of 2)

RAS analysis is done for the midplane, service card, all link cards, and all node cards. It is not done for the nodes within a node card.

Diagnostic testing will be done to ensure replaced service card, link cards, node cards, and nodes are functional.

It is safe to replace a service card using a midplane service action since all link cards and node cards are powered off.

Software does not power off the FMs, instead it attempts to flash the intervention LEDs.

The midplane service action uses the ServiceNodeCard Rxx-My-N and ServiceLinkCard Rxx-My-L support to service the node and link cards within the midplane.

Page 39: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceRack (1 of 2)

A rack service action allows all service actions which do not require tools to be performed.

The clock card can only be replaced by a ServiceClockCard service action. At least one BPM must be plugged in at all times during a rack service action so

that 5V persistent power is maintained to the clock card. The BPMs are powered off via software (the breaker is not flipped). This allows

the clock card to remain functional. To end the service action, the CE must reseat one of the BPMs to provide

enough power to the master service card. The remaining BPMs are powered back on via software.

The rack service action uses the ServiceMidplane Rxx-Mx support to service the service card, node cards, link cards, and fan modules within those midplanes.

Page 40: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceRack (2 of 2)

A service action can not be started for a rack if there are open service actions within a rack.

A new request to service hardware within the rack will conflict with an open rack service action.

RAS analysis is done for the midplanes, service cards, link cards, and node cards in the rack. It is not done for the nodes within a node card.

Diagnostic testing will be done to ensure replaced service cards, link cards, node cards, and nodes are functional.

REMINDER: Do not use ServiceRack to power cycle a rack.

Page 41: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceClockCard (1 of 3)

ServiceClockCard is used to: Prepare the rack so that the bulk power breaker can be manually turned off. Service the specified clock card.

Once the breaker has been turned off, any component within the rack can be serviced.

Jobs are stopped and blocks are freed in all midplanes that are downstream from this rack in the clock tree.

The clock tree is defined in the TBGPClockCables table. Downstream midplanes are set to SERVICE status to prevent the job scheduler

from running. The bulk power breaker must be turned on to repower the rack.

Page 42: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceClockCard (2 of 3)

Service action conflict rules are the same as ServiceRack PLUS:

A clock card service action can not be started if there are open service actions in any of the midplanes that are downstream from this rack in the clock tree.

A new service action can not be started if there is an open clock card service action upstream from the hardware to be serviced in the clock tree.

Page 43: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

ServiceClockCard (3 of 3)

RAS analysis is done for the clock card, midplanes, service cards, link cards, and node cards in the rack. It is not done for the nodes within a node card.

Diagnostic testing will be done to ensure replaced service cards, link cards, node cards, and nodes are functional.

The clock card, if replaced, is tested to verify that it is functional. If the clock card was not replaced, it is assumed to be functional.

Set the status of all midplanes that are downstream from this rack in the clock tree to ACTIVE status.

Page 44: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Service Action from Blue Gene Navigator

Service action home page

Filter Options for Service Actions

history.

Page 45: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Starting and ending Service Action from the Navigator

Select the target hardware type

Select the target hardware locationIf there are no jobs affected

by the service action click

“Finish” to start the service

action.

The new service action will be

in the “Service Actions History”.

Click “End Service Actions” button

to end the target Service Action.

Page 46: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Cycling power

Prior to shutting power off on a rack you need to properly prepare the rack. Preparing a rack to be powered down requires that you run the ServiceClockCard. This process can be completed from either the Navigator or the command line. The command line syntax to start a service action on a clock card is:

$ ServiceClockCard Rxx-K PREPAREThe ServiceClockCard process terminates any jobs that are running on that specific rack and jobs that are running on racks that contain downstream clocks (any clocks that rely on the clock being shut down for a clock signal). Once the ServiceClockCard command completes you will receive a message indicating that the rack is ready for service. At this point you can power down the rack. To bring the rack back on line move the switch to the On position. Allow the rack sufficient time to power up, usually less than a minute, before ending the Service Action. A good indicator would be to watch for the lights on the Service cards to start blinking. To end the Service Action from the command line use the following command:

$ ServiceClockCard Rxx-K ENDWhen powering up a multiple rack system be sure to power up the rack with the master clock first, followed by any secondary clocks, then racks that contain tertiary clocks and finally all other racks. Follow the same scheme when ending the Service Actions.

The bulk power breaker must be turned on to repower the rack.

Page 47: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

© 2007 IBM Corporation

IBM Global Engineering Solutions

IBM Blue Gene/P

RAS

Page 48: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

BG/P RAS Requirements

Improve the programming of RAS messages RAS events should include a message id and brief description of

the problem. Users should be able to access additional details about an event

including a detailed description about what happened along with the recommended actions to take, if any.

The diagnostics should utilize RAS events to report problems to simplify problem reporting and reduce the number of logs that users have to care about.

Minimize the time it takes to display a set of RAS events. Application developers should be able to log events.

Page 49: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

CNK, Linux, andOn-Core Diagnostics

MC ServerMachine Controller

MMCSError codes

Event Logs

Off-CoreDiagnostics

Control System Compute and I/O Nodes

Card

Controller

Card

Controller

Card

Controller

Card

Controller

BareMetal

Page 50: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Linux and CNK

bgcns

1a. Kernel creates a RAS event and uses mbox to deliver it.

MC Server Machine Controller

MMCSError codes

Event Logs

2. MC reads andInterprets events

4. Event sent to registeredlisteners

4a. MMCS processes event

5a. Persist formatted msg

Navigator

6. Query RAS events and run diagnostics.

Compute node serviceFunnels RAS events toThe control system.

1b. MC generates RAS events for node, link, and service cards and the chips on the cards.

1c. Diagnostics generateRAS events to report testresults.

3. Event sent to registered listener

1. RAS Events can be generated in various components and processes.

Off-CoreDiags

Page 51: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

Integrating RAS, Diagnostics and Service

Diagnostics utilize the control system. Facilitates maintenance and support.

The diagnostics use RAS events to report problem conditions. This improves problem isolation and reduces the number of logs that the

service team has to examine. In addition, service actions will copy a subset of event logs to the card

VPD as part of a replacement procedure. Facilitates failure analysis and helps identify failure patterns.

Page 52: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

RAS Event Content

RAS events include a Message id

Severity

Message

Detailed description of the condition and its causes.

Recommended service actions

Other stuff as appropriate (location, cpu, ecid, job id, etc)

Page 53: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

User RAS Events

These are available to 3rd Parties working on I/O Node development. A set of RAS Events have been reserved for user code (like Lustre):

USER_0101-010A Reserved for user events with severity ERROR USER_0201-020A Reserved for user events with severity FATALUSER_0301_030A Reserved for user events with severity INFOUSER_0401_040A Reserved for user events with severity WARN

Logging a user event from user space - use bgras: call exec /bin.rd/bgras <comp> <subcomp> <error code> <text>

Logging a user event from kernel space - use bgcns apis: _bgp_cns()->writeRASEvent(facility, unit, err_code, detail_words, details);_bgp_cns()->writeRASString(facility, unit, err_code, (char *)text);

Users can also update the descriptions in the TBGPErrCodes table.db2 "update tbgperrcodes set description = 'Your lustre file system is hosed' , svcaction = 'Get

someone to fix the file system' where msg_id = 'USER_0201' "

Page 54: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

User RAS Events

These are available to 3rd Parties working on I/O Node development. A set of RAS Events have been reserved for user code (like Lustre):

USER_0101-010A Reserved for user events with severity ERROR USER_0201-020A Reserved for user events with severity FATALUSER_0301_030A Reserved for user events with severity INFOUSER_0401_040A Reserved for user events with severity WARN

Logging a user event from user space - use bgras: call exec /bin.rd/bgras <comp> <subcomp> <error code> <text>

Logging a user event from kernel space - use bgcns apis: _bgp_cns()->writeRASEvent(facility, unit, err_code, detail_words, details);_bgp_cns()->writeRASString(facility, unit, err_code, (char *)text);

Users can also update the descriptions in the TBGPErrCodes table.db2 "update tbgperrcodes set description = 'Your lustre file system is hosed' , svcaction = 'Get

someone to fix the file system' where msg_id = 'USER_0201' "

Page 55: © 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Environmentals

IBM Blue Gene/P System Administration

RAS Policies

MMCS will kill the job and free the block for KERNEL FATAL Ras Events.

That means the component is KERNEL.

The severity is FATAL.

The failing node is included in the block.