Troubleshooting and Maintaining Your Servers

CHAPTER 28

Now that you have all your Informix database servers online, you don’t have anything else toworry about, do you? I think that we all know the answer to that question! This chapter

describes strategies that help you tune and monitor your Informix database servers. Learn the day-to-day tasks that can help keep your systems healthy for a long time. In the event that problems dooccur, I help you understand how to solve them. This chapter helps you learn the following:

• Things to consider before creating your Informix instanceProper planning can help you make excellent long-term decisions for the usage of disks,memory, CPU, and database configuration. This chapter helps you make these decisions.

• How to perform ongoing system monitoringMany basic tasks can help you understand what is going on with your system, prevent problems later, and keep the system running at its best levels. I show you some specificcommands and suggest strategies.

• Methods of tuning and checking your databasesYou have different ways to help keep your systems charged up and error-free for the longterm. Many of these tasks need not be performed more than once a week—or even once amonth. Some of these tasks are shown.

• How to troubleshoot problemsA big part of troubleshooting is a learning process. However, thinking things through andfollowing basic procedures help you more easily get to the bottom of problems. As a whole,this chapter helps you determine the troubleshooting process.

Creating a Healthy Environment

The old adage say s , “ P revention is the best medicine.” This is just as true in Info rmix server adminis-t ration. Installing Info rm i x , b ri n ging it online, and cre ating some dat abases may be easy, but they areo n ly the begi n n i n g. A key to long-term success is cre ating an env i ronment that is constantly being tunedfor its own needs. In this section, I discuss some of the ways to build for the future of your dat ab a s e s .

Ron FlanneryOne Point Solutions

755

Troubleshooting andMaintaining Your Servers

756 The Informix Handbook

■ Setting the Initial Parameters

You have many things to consider when creating your Informix instance. Properly configuring cer-tain parameters is key to long-term success. It may seem like a lot of work now, but it can certain-ly be easier than re-creating your whole instance after it has been online for a while. Some of themost important parameters to estimate are:

• Growth of your tables

• Insert, update, and delete activity

• The most active tables

• Level of database update during peak hours

• Number of users that concurrently access the data

• Types of things the users need from the databases, both short- and long-term

• The configuration of your operating system and computer—the needs for this mightchange, depending on the needs of your database

You can use these estimates to set the following crucial parameters in the instance configura-tion (onconfig file):

• Size and location of dbspaces, including root (ROOTPATH)

• Number of locks (LOCKS)

• Location and size of the logical and physical logs (LOGFILES, LOGSIZE, PHYSDBS,PHYSFILE)

• Location and size of the temp dbspaces (DBSPACETEMP)

• Configuration of buffers and LRUs and other shared memory parameters (BUFFERS,CKPTINTVL, CLEANERS, LRU_MAX_DIRTY, LRU_MIN_DIRTY, LRUS)

• Number of CPUs on your system and how Info rmix uses them (A F F _ N P R O C,AFF_SPROC, MULTIPROCESSOR, NUMAIOVIPS, NUMCPUVPS, SINGLE_CPU_VP)

• Amount of memory on your system and Informix instance (SHMADD, SHMTOTAL,SHMVIRTSIZE)

• Your needs for DSS (data warehouse, for example) applications (DS_MAX_QUERIES,DS_MAX_SCANS, DS_TOTAL_MEMORY)

A lot of the information in this chapter (including the previous list of onconfig values) is dis-cussed in various chapters in this book and is not detailed here. For specific references, see the“For More Information” section at the end of this chapter.

Of course, to set the perfect values for these parameters before you actually have users on thesystem is almost impossible. In fact, the rest of this chapter discusses ways to monitor these andother parameters, changing them if necessary. For now, please remember that the initial configura-tion can greatly reduce the amount of work needed later. A proper initial setup also makes the ongo-ing maintenance a process of improving rather than fixing.

Creating a new Informix instance can involve a lot of steps that can be difficult to repeat. Sure,you can take notes about what you did, but if you have to re-create the instance for some reason,

Chapter 28 Troubleshooting and Maintaining Your Servers 757

missing a step can be easy. Rather than using onmonitor or manually typing each command , Isuggest creating a shell script to create the initial dbspaces and other parameters. For example, thefollowing lines add a new dbspace with a size of 2 GB and then add a chunk to it:

onspaces -c -d dbspace1 -p /dev/disk1 -o 0 -s 2000000onspaces -a dbspace1 -p /dev/disk2 -o 0 -s 2000000

Your script should include all the dbspaces and chunks that you think are needed for your system.

The best approach is probably to ensure that you are using the latest version of applicableInformix products for your hardware and operating system. Each version improves on past versions,optimizing performance and fixing bugs along the way. You can check the current production infor-mation by visiting the Informix Web site at www.informix.com.

The initial IDS configuration is discussed in detail in Chapter 21, “Setting Up UNIX DatabaseServers.”

■ Setting Up an Alarm Program

I n fo rmix allows you to provide a program or shell script that can take certain actions in the event ofp ro blems. If certain Info rmix or operating system erro rs occur, this program is exe c u t e d. The fullp athname to this program is provided in the o n c o n f i g file as the parameter A L A R M P R O G R A M. Th ep rogram can allow you to do things like page the Info rmix administrator or display a message on thesystem console. The program is passed the para m e t e rs in the order that they are summari zed in Tabl e2 8 . 1 .

Table 28.1 ■ Values passed to the ALARMPROGRAM

Summary Description

Severity The severity of the event, from 1-5. The values are summarized: 1=small configurationchanges; 2=informational message (no error) about routine events; 3=attention—somethingoccurred that requires attention but does not prevent use of system; 4=emergency—somethinghappened that could compromise the data or the instance; 5=fatal error that causes the data-base server to go offline.

Class ID A numerical representation of the error that occurred. The Administrator’s Guide to InformixDynamic Server summarizes a list of over 20 different error numbers and descriptions. Thedescription of the associated error message is actually given in the next parameter, “ClassMessage.”

Class Message The text of the message that represents Class ID. An example is “Logical Logs are full—Backup is needed.”

Specific Message More detail on the error that occurred. This is likely the same message that is written to theInformix log file.

Extra Info Path If appropriate, pathname to a file that contains more information about the error.

For example, if you want to send an e-mail to the Informix administrator for error levels of atleast 3 and send a message to her pager for levels 4 and 5, you could create a shell script as inListing 28.1 and set ALARMPROGRAM to point to it.


LISTING 28.1

A sample alarm program.

#/bin/shSEVERITY=$1 # first argument to this shell scriptCLASS_ID=$2CLASS_MSG=$3SPECIFIC_MSG=$4INFO_PATH=$5if [ $SEVERITY -ge 4 ] # severity greater than or equal to 4 - a bad thingthen

SEV_MSG="****** THIS ERROR REQUIRES IMMEDIATE ATTENTION!! *******"else

SEV_MSG="WARNING:"fi# place datetime and error message information in the variable MY_MSGecho "INFORMIX MESSAGE AT `date`:$SEV_MSGSEVERITY LEVEL: $SEVERITYCLASS ID: $CLASS_ID CLASS MSG: $CLASS_MSGSPECIFIC MSG: $SPECIFIC_MSGEXTRA INFO: $INFO_PATH" > /tmp/$$.log # write message to a temp file

if [ $SEVERITY -ge 3 ] # severity greater than or equal to 3then

mail dba_list < /tmp/$$.log # send msg from file to email alias dba_listif [ $SEVERITY -ge 4 ] # severity greater than or equal to 4then

dba_pager.sh 5551212911 # run shell script to dial dba pager with # return phone# + 911

fifi# write to a system log filecat /tmp/$$.log >> /usr/informix/logs/err_msg.log/bin/rm /tmp/$$.log # clean up

When a condition occurs to signal an alarm, this shell script is called, triggering the appro-priate events. Setting up this script notifies people immediately, helping correct the problems andavoiding destructive downtime. Suppose that this script is called ifmxalarm.scr. The entry inthe onconfig file that tells Informix to use this script is as follows:

ALARMPROGRAM /usr/apps/tools/bin/ifmxalarm.scr

Performing Routine Informix Tasks

An administrator can do several things to monitor the performance of database servers. Regularlychecking your databases and servers can help prevent performance bottlenecks and more seriousproblems. This section outlines some tasks that can be performed on a regular basis. The sidebar byClem Akins, former Informix Technical Support Engineer, provides some basic tips on the processinvolved with tuning a system.


Be sure to use automated tools to provide a numerical history of your system. I was at acustomer site that received a call from the IS manager. The manager swore that the tuningchanges we made two weeks before had cut the speed of her application in half. When sheprovided a graph of the application speeds, the daily graph showed an almost flat trend—noperformance loss at all. She did have a bad day that ran at double the normal time for a lit-tle while, but even that was not as low as it had been before our changes.

The moral of the story is to use science, not gut intuition, to tune your system. Gatherdata via computer programs, graph your statistics, track them in good times and bad, andmap the changes to their resulting measurable effects on the system.

■ Using the Banner Line

A quick way to check on the status of your Informix instance is to run the command onstat -.This displays a simple banner line that indicates current status. The banner line is displayed onmany onstat commands. Here is an example of this output of onstat -:

INFORMIX-OnLine Version 7.23.UC1 -- On-Line -- Up 2 days 13:58:15 -- 10336 Kbytes

The information includes:

• Informix version: 7.23.UC1

• Status of the instance: On-Line

• Length of time the instance has been online: Up 2 days 13:58:15

• Amount of memory being used by the instance: 10336 Kbytes

One more line can be in the display of onstat -. It indicates whether Informix is “blocked”because of a checkpoint, long transaction, media failure, and other reasons. The above line indicatesa healthy server.

■ Using the Users

The people who are regularly using the system most likely notice any changes in the system’sbehavior. They are the ones that perform the queries and they notice things that are different in theirnormal response times. If you don’t hear anything from them about differences in the behavior ofthe system, you might occasionally want to ask them. If you don’t discover problems with theInformix instance, you might find out about queries or programs that need tuning.

■ Using onmonitor

The o n m o n i t o r command is a front-end tool to many of the o n s t a t and other commands dis-cussed in this ch ap t e r. It prevents you from having to remember the syntax of each command, but ism o re limited in the info rm ation that it provides to you. It provides a user with a lot of status info rm a-tion and provides administrat o rs with the ability to perfo rm just about any administrat ive f u n c t i o n .

Informix has begun to phase out onmonitor, replacing it with graphical tools like IECC, ISA,and onweb. Check your documentation to find the appropriate command for your system.

HOW TO TUNE YOUR SYSTEM


For the purpose of this chapter, I discuss only the informational portion of onmonitor. Tostart it, type onmonitor from the UNIX command line. If you are user informix or root,select Status from the first menu. You see the screen displayed in Figure 28.1.

The menu options are summarized as follows:

• Profile. Displays a profile of activity on the Informix instance (see Figure 28.2). This hasmuch of the same information as onstat -p.

• Userthreads. Displays information on the currently active user threads. This display ismuch like that given by onstat -u.

• Spaces. Shows information about all the dbspaces for the current instance. This is muchlike the output of onstat -d but provides a friendly way to see the chunks in a dbspace.

• Databases. Provides a display of all databases in the current instance. It includes databasename, owner, dbspace, creation date, and logging status. This information isn’t availabledirectly through an onstat command. See Figure 28.3 for an example.

• Logs. Shows the status of the current physical and logical logs. This is like the onstat-l command.

• Archive. Shows information on the last level 0, 1, and 2 archives. This can be obtainedthrough oncheck -pr but is harder to find.

• Data-Replication. Gives information about current data replication parameters, includingthose in the onconfig file.

• Output. Allows you to choose a file to receive status information.

• Configuration. Prompts you to select a file to display the onconfig file. This output canbe obtained by reading the onconfig file or using onstat -c.

• Exit. Exits the menu.

Figure 28.1 ■ The onmonitor status menu.


Figure 28.2 ■ Display of the Profile menu option.

Figure 28.3 ■ Display of the Databases menu option.

The output of Fi g u re 28.2 is the screen that displays if you select the P r o f i l e option in Fi g u re 28.1.The output of Figure 28.3 is the screen that displays if you select the Databases option in

Figure 28.1. Note the Dbspace name Log Status: they are harder to obtain via other means.


■ Watching the Message Log

The Informix message log is a file that tracks messages about its Informix instance. It shows theresults of normal activity and errors. The message log is found in the file specified by the MSGPATHparameter in the onconfig file. To find the value of MSGPATH, you can type one of the follow-ing commands:

grep MSGPATH $INFORMIXDIR/etc/$ONCONFIG onstat –c | grep MSGPATH.

The pathname to the file is also displayed at the beginning of the output of onstat –m,which displays the last 20 lines of the message log.

If you have multiple instances of Informix on the same machine, you can implement naming con-ventions that simplify your administration. You can choose a simple name like dev or prod anduse it across different values used by your Informix instance. This includes the naming of theo n c o n f i g file (described by the env i ronment va ri able $ O N C O N F I G) , its server names(described by onconfig values DBSERVERNAME and DBSERVERALIASES), and the messagelog (onconfig value of MSGPATH).

For example, if you have two Informix instances on your machine, one for test and one for pro-duction, the onconfig files can be called onconfig.test and onconfig.prod; the serv-er names (DBSERVERNAME and DBSERVERALIASES) can be test_shm, test_tcp,prod_shm, and prod_tcp; and the message files (MSGPATH) can be called test.log andprod.log. The consistent naming of onconfig, its server values, and the message log keepsthings a lot more consistent.

Informix updates the message log file with messages about that server’s performance, including:

• Time and duration of checkpoints

• Information about when Informix was started or shut down

• Information about critical errors that forced Informix to shut down

• Information about system backups and restores

• Status of logical log backups

• Changes in parameters in the onconfig file

The message file continues to grow until you either delete it or archive it. To maintain a his-tory of the activity on your system, your best bet is often to compress and save the old message logfiles. For example, you can create a cron job that runs once a month and compresses the previousmonth’s log file. You can give the compressed file a name that reflects its month, such as:

prod_msg.log.200001

After you archive and delete the previous month’s message log, Informix automaticallybegins a new one in the same location (MSGPATH in onconfig).

Log files contain a lot of information when an error occurs in your Informix instance. Thisinformation can be useful to you and/or Informix technical support. Sometimes the log file containsa recommendation for how to correct the problem, as well as dump and other informational files.Listing 28.2 is an example of the output of a log that indicates a serious system problem.


LISTING 28.2

Example of a serious error displayed in a message log.

10:51:10 Assert Failed: Page Check Error in btcurrent:badcurrent node

10:51:10 Who:Session(131, ron, 23994, 554376032)Thread(1371, sqlexec, 21090fec, 4)

10:51:10 Results: Possible inconsistencies in'xx01abcd:"informix".customers'

10:51:10 Action: Run 'oncheck -cDI 6449916'10:51:10 See Also: /DUMPDIR/af.56cb0d, gcore.56cb0d.0,

/DUMPDIR/prob/core10:51:10 Stack for thread: 1402 sqlexecbase: 0x21a66010len: 66048pc: 0x0834f884

tos: 0x21a751740x0834f884 afstack0x0834ffaa mt_affail0x08259432 bffail0x08212940 btcurrent0x08305149 find_page0x08306684 rsread0x082e9c0f isread0x08362b97 fmread0x08094273 sqisread0x080b35ac gettupl0x080b37a8 scan_next0x080c49bd hjoin_open0x080b5d43 prepselect0x0815f0ab open_cursor0x081a7241 ip_scurstart0x081a79e4 ip_evalcursor0x081aa1c7 ip_fetch0x080b4a71 getrow0x0816e555 sqmain

This output might be ugly, but it’s useful. The information displayed includes:

• Type of error: Page Check Error in btcurrent:bad current node

• User name, session, and thread: (131, ron, 23994, 554376032) [el]Thread(1371, sqlexec, 21090fec, 4)

• Table name: ‘xx01abcd:“informix”.customers’

• Possible corrective action: Action: Run ‘oncheck -cDI 6449916’

• A file with more information: /DUMPDIR/af.56cb0d, gcore.56cb0d.0,/DUMPDIR/prob/core

• A stack trace

Informix technical support can use the log file to determine the exact cause of errors. Formore information, see “Correcting and Troubleshooting Problems” later in this chapter.

A good idea is to monitor the messages in the message log. Here are some different ways todo this:


• Use the Informix command onstat -m. This is a quick way to see whether everything isnormal. This command also shows the full pathname to the message log file. Listing 28.3shows an example of some of the output of this command.

LISTING 28.3

Sample output of onstat -m.

Message Log File: /usr/informix/online_prod.logWed Sep 17 00:01:28 1997

00:01:28 Checkpoint Completed: duration was 0 seconds.00:31:05 Checkpoint Completed: duration was 1 seconds.01:01:06 Checkpoint Completed: duration was 1 seconds.01:01:25 Level 0 Archive started on rootdbs, dbspace101:31:27 Checkpoint Completed: duration was 0 seconds.02:01:27 Checkpoint Completed: duration was 0 seconds.02:02:12 Archive on rootdbs, dbspace1 Completed.02:02:58 Logical Log 95 Complete.02:31:28 Checkpoint Completed: duration was 0 seconds.[el]

• Use different operating system commands to parse the file. For example, on UNIX, thecommands grep, tail, and lp are just a few. Consider this: Suppose that your messagelog file is located in /usr/informix/prod.log. If you want to search for the logicallogs that were recently filled, you can issue the command:

grep –i "logical log.*complete" /usr/informix/prod.log | tail

• Read the message log file into a text editor and search for certain messages. For example,you can search for the words error, failure, transaction, aborted, roll, etc.If using vi, be sure to ignore case for your searches by typing :set ic.

• Create an operating system process that watches the file for certain error messages. Thiscan be created as a background job that is constantly running (you can use cron in UNIX).For example, the script shown in Listing 28.4 watches for Assert Failed in the logand sends e-mail to dba_list if the message is found.

LISTING 28.4

Shell script to watch the error log.

#!/bin/sh

while [ true ]do

x='tail /usr/informix/ prod.log | grep "Assert Failed"' if [ "$x" ] then

echo ASSERT FAILED FOUND IN ERROR LOG! | mail dba_listfisleep 60

done


• Watch the interval and duration of checkpoints. The interval is specified by the onconfigparameter CKPTINVL. If the duration of checkpoints is more than 2 or 3 seconds—or itstarts getting longer—it may be time to tune your checkpoint and other settings. Also, if theinterval of checkpoints begins getting shorter than CKPTINVL, you may need to tune yourphysical log parameters. For example, if the checkpoint interval is set to 1800 (30 minutes)and checkpoints are occurring every 20 minutes, the time might be right for some changes.In Listing 28.3, for example, notice that the checkpoint interval is about 30 minutes and theduration is 0 or 1 second. If the interval or duration changed, it might be time to do sometuning. To look at all the checkpoint intervals in your cur rent log file, you can run the fol-lowing UNIX command:

grep /usr/informix/ prod.log "Checkpoint Completed"

Or to see just the last 20 checkpoints, use the tail command, as in:

grep /usr/informix/ prod.log "Checkpoint Completed" | tail –20

See Chapter 4, “Understanding Informix Architecture,” for a detailed discussion of checkpoints and the parameters that affect them. Some of the appropriate parametersinclude CKPTINTVL, LRUS, LRU_MAX_DIRTY, LRU_MIN_DIRTY, BUFFERS, andCLEANERS.

• Watch for logical logs that are filling up too quickly. For example, if you see the messageLogical Log xx Complete too often, you might want to change the number or sizeof the logical logs.

For more info rm ation about ch e ck p o i n t s , l ogical log s , and more, see Chapter 4,“Understanding Informix Architecture.”

■ Watching the System Performance Profile

Informix allows you to get a snapshot of many of its performance statistics with the commandonstat -p. Listing 28.5 displays sample output from the onstat -p command.

LISTING 28.5

Output of the onstat -p command.

Profile

dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached

12889 271012 149913 91.40 150 632 1000 85.00

isamtot open start read write rewrite delete commit rollbk

151649 5416 7881 113855 1357 6 230 217 0

ovlock ovuserthread ovbuff usercpu syscpu numckpts flushes

0 0 0 107.22 92.05 175 9590

bufwaits lokwaits lockreqs deadlks dltouts ckpwaits compress seqscans

208 0 163575 0 0 34 259 115

ixda-RA idx-RA da-RA RA-pgsused lchwaits

308 9 35 313 6

Some of the items in this output include:

• Number of disk reads: dskreads


• Disk read cache percentage (generally should be in the mid-90s): %cached

• Number of disk writes: dskwrits

• Disk write cache percentage (generally should be in the mid-80s): %cached

• Overflows of different items (should be at or near zero): headings with ov

• Waits (should be a reasonable number and not growing too quickly): headings with waits

• Sequential scans (too many could show a need for indexes): seqscans

Using the profile information gives you a snapshot of your system. For more detail on the val-ues displayed by this command, see Chapter 33, “System Tuning,” or Chapter 26, “AdministrationUtilities.”

■ Status of Logical and Physical Logs

The logical and physical logs are used to maintain the consistency of your system. They aredescribed in detail in Chapter 4, “Understanding Informix Architecture,” and in the Administrator’sGuide for Informix Dynamic Server. The current status of these logs is given in the output of theonstat -l command, as shown in Listing 28.6.

LISTING 28.6

Output of onstat -l command.

Physical Logging

Buffer bufused bufsize numpages numwrits pages/io

P-1 0 16 4000 324 5.01

phybegin physize phypos phyused %used

10003f 1000 611 0 0.00

Logical Logging

Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io

L-2 0 16 2654 285 241 9.3 1.2

address number flags uniqid begin size used %used

a1b3ee6 1 U-B---- 61 100300 250 250 100.00

a1b3ee7 2 U---C-L 62 100550 250 249 99.60

a1b3ee8 3 U-B---- 57 100800 250 250 100.00

Some of the information to watch includes:

• P hysical log write and i/o info rm at i o n : n u m w r i t s, p a g e s / i o. If p a g e s / i o is ve ry cl o s eto b u f s i z e, consider increasing your physical log bu ffer size. For ex a m p l e, in this output,if pages/io were 15.78, you should consider increasing the physical log buffer size.

• Physical log used information (if these are too high, consider changing physical log para-meters to a higher value): phyused (pages) and %used. When the physical logs are 75percent full, a checkpoint is initiated. Thus, if you notice that checkpoints are occurring toosoon because of the physical log becoming 75 percent full, you will probably want toincrease physical log size or change the LRU settings. See Chapter 4, “UnderstandingInformix Architecture,” for more information about how to do so.

• The flags indicate logical log status. If you are noticing too many logs filling that are notbacked up, you may need to change the size and number of logical logs. A log is backed upwhen its status is B.


• Percentage used. High numbers for the logs not backed up (no B flag) indicate that youneed to back up the logs and possibly change the log file parameters.

• Logical log write and i/o information: numwrits, pages/io. If pages/io is veryclose to bufsize, consider increasing your logical log buffer size. For example, in theabove output, if pages/io were 15.78, you should increase the logical log buffer size.The usage of logical logs depends on the backup strategy of the logical logs and loggingstrategy of its databases. The more transactions (logged) databases you have, the morequickly the logs fill. You may need to consider this as part of your log backup strategy: Willyou use continuous or automatic log backups?

Be very careful to ensure that the logical logs are never completely filled. This situation causesprocessing of your instance to stop for all databases and, in an extreme case, can mean restorationof all your databases from tape. Monitoring the logs and following proper log backup proceduresshould prevent this from happening. Logs generally fill when they are either not backed up or along transaction occurs.

You also need to ensure that no single update transaction occupies more log space than thepercentage listed in the onconfig parameters LTXHWM and LTXEHWM. If a transaction spans apercentage of logs more than the number in LTXHWM, the transaction is rolled back.

■ Watching the Buffers

The physical log bu ffe rs wo rk in conjunction with the physical logs to pre s e rve “ b e fo re image s ” o fd ata during update operations. In add i t i o n , the bu ffer pool in shared memory is used to pro c e s su p d ates to the dat abases. Proper tuning of these para m e t e rs can gre at ly enhance the perfo rmance ofI n fo rmix. A l s o , the physical log and the bu ffer pool are crucial in the Info rmix re c ove ry strat egy.

The tuning of the buffer pool is done in conjunction with the onconfig parametersBUFFERS, LRUS, CLEANERS, LRU_MAX_DIRTY, and LRU_MIN_DIRTY. The BUFFERS aredivided into the number of LRU queues, the number of which is denoted by the LRUS parameter.In each queue are free (f) and modified (m) buffers. The page cleaners (CLEANERS) help flush thequeues when they’re too full or when a checkpoint occurs. If the percentage of modified buffersexceeds the percentage in LRU_MAX_DIRTY, page cleaning begins. The buffer pool is also cleanedat a checkpoint. If this page cleaning is not properly tuned, the performance of Informix can suffer.The command onstat -R is used to monitor LRU usage and is displayed in Listing 28.7.

LISTING 28.7

Output of the onstat -R command.

4 buffer LRU queue pairs# f/m length % of pair total0 f 50 100.0% 251 m 0 0.0%2 f 25 100.0% 253 m 0 0.0%4 f 25 100.0% 255 m 0 0.0%

[el]


0 dirty, 200 queued, 200 total, 256 hash buckets, 2048 buffer sizestart clean at 60% (of pair total) dirty, or 15 buffs dirty,

stop at 50%

If you notice that the perc e n t age of modified (m) bu ffe rs tends to be close toLRU_MAX_DIRTY, you see more page cleaning activity.

The buffers also influence the checkpoint activity. If the buffers become too “dirty,” thecheckpoints take longer to clean the buffers.

Three types of writes are done when Informix flushes the buffer pool. From least efficient tomost efficient, these types are:

• Foreground Writes: These occur when an sqlexec thread needs to locate an emptybuffer to read information into and can’t find one. The sqlexec thread flushes bufferpages to disk to make room for the data that is being read in. Performance can suffer if itdoes so too often. To prevent this degrading, you can add more page cleaners and/or lowerthe LRU_MAX_DIRTY and LRU_MIN_DIRTY parameters.

• LRU Writes: These occur when the percentage of dirty buffers is greater than theLRU_MAX_DIRTY percentage. The page cleaners perform LRU writes.

• Chunk Writes: These occur during checkpoints and are performed by the page cleaners.These are the most efficient because they are sorted writes. The modified pages are sortedbefore being flushed to disk. This approach minimizes head movement on the disks andallows the use of the big buffers in shared memory. In a healthy OLTP system, however,you will want to balance LRU writes and chunk writes, because chunk writes cause longercheckpoints.

The onstat -F command can monitor all these types of writes, as shown in Listing 28.8.

LISTING 28.8

Output of the onstat -F command.

Fg Writes LRU Writes Chunk Writes0 6 2403

address flusher state dataa262334 0 I 0 = 0X0

states: Exit Idle Chunk Lru

Tuning buffer activity is a tenuous art that requires a lot of experimentation and tuning. Iadvise changing only one parameter at a time and then using these monitoring tools to fine tuneyour changes.

■ Tracking Disk Usage

Disk configuration can often be a tough balance in Informix Dynamic Server. A prevailing goal isto make the most efficient use of your disks while not filling them up. Disk layout should beplanned carefully, especially when you are first creating the Informix instance. Obviously, you can-not completely predict the usage of your system, but you want to put the most you can into the plan-ning of disk space. If you need to change something, you can add chunks later. Listing 28.9 is theoutput of onstat -d, which shows the current dbspaces and chunks in your Informix instance.


LISTING 28.9

Output of the onstat -d command.

Dbspaces

address number flags fchunk nchunks flags owner name

a28e3e 1 1 1 1 N informix rootdbs

a28e4e 2 1 2 1 N informix dbspace

a28f2a 3 1 3 1 N T informix tempdbs

2 active, 2047 maximum

Chunks

address chk/dbs offset size free bpages flags pathname

a26f220 1 1 0 25000 10000 PO- /usr/informix/rootdbs

a26f240 2 2 0 60000 45000 PO- /usr/informix/chunk1

a26f290 3 3 0 50000 49000 PO- /usr/informix/tempdbs


The output is divided into two parts: dbspace information and chunk information. In a nut-shell, a dbspace is a collection of chunks. Each chunk represents all or part of a physical disk drive.The dbspaces are displayed in the top part of the output followed by the chunks that map to them.The “number” field in the dbspaces section is the dbspace number, which maps to the “dbs” fieldin the chunks portion of the screen. For this discussion, I look only at the chunk usage.

The columns that are of particular interest are size and free. These columns define inpages the size of the chunk and how much of it remains free. The page size can be found by usingthe command onstat -b.You must monitor the amount of available pages, particularly in the rootdbspace: If it fills, your whole instance can be forced offline.

You can find a more detailed discussion of dbspace usage in Chapter 8, “Creating Databasesand Tables.”

Also watch the reads and writes per chunk. These can be monitored using the onstat –Dcommand as shown in Listing 28.10.

LISTING 28.10

Output of the onstat -D command.

Dbspaces

address number flags fchunk nchunks flags owner name

a28e3e 1 1 1 1 N informix rootdbs

a28e4e 2 1 2 1 N informix dbspace

a28f2a 3 1 3 1 N T informix tempdbs


Chunks

address chk/dbs offset page Rd page Wr pathname

c0d21218 1 1 0 173 4570 /usr/informix/rootdbs

c0d216f8 2 2 0 245545 49995 /usr/informix/chunk1

c0d217d8 3 3 0 43593 49192 /usr/informix/tempdbs


Another command you can use to watch the activity for certain chunks is onstat -g iof.Listing 28.11 is the output of this command.


LISTING 28.11

Output of the onstat -g iof command.

AIO global files:gfd pathname totalops dskread dskwrite io/s3 rootdb1 15562 12746 2816 0.04 dbspace1 64 64 0 0.0

If you notice that one of the chunks has a high number of reads or writes when compared toother chunks, it could be causing performance problems. In this case, you can consider redistribut-ing the data on that chunk.

■ Watching the VPs

The virtual processors (VPs) help Informix process things efficiently. VPs are “mini CPUs” thatare actually operating-system processes. These VPs handle user “threads” and provide the multi-threading capabilities of Informix Dynamic Server. You want to have enough VPs and have themproperly configured.

The command onstat -g glo provides information about all the current virtual processorsas well as statistics for each class of virtual processor (cpu, for example). You can use the resultsof this command to search for certain classes of VPs that might have an inordinate amount of activity, causing you to add more VPs. Listing 28.12 displays an example of the onstat –g glooutput.

LISTING 28.12

Output of the onstat -g glo command.

MT global info:sessions threads vps lngspins32 73 16 0

sched calls thread switches yield 0 yield n yield forevertotal: 25043676 5157827 19988544 86356 2075996per sec: 1577 1576 0 1 775

Virtual processor summary:class vps usercpu syscpu totalcpu 3 1550.81 175.54 1726.35aio 8 88.33 404.99 493.32shm 1 10.36 10.58 20.94lio 1 0.28 1.32 1.60pio 1 0.24 1.09 1.33adm 1 1.42 3.70 5.12msc 1 0.04 0.03 0.07total 16 1651.48 597.25 2248.73

Individual virtual processors:vp pid class usercpu syscpu total1 17850 cpu 320.16 81.55 401.71

Chapter 28 Troubleshooting and Maintaining Your Servers 7712 17852 adm 1.42 3.70 5.123 17853 cpu 779.39 53.63 833.024 17854 cpu 451.26 40.36 491.625 17855 lio 0.28 1.32 1.606 17857 pio 0.24 1.09 1.337 17862 aio 9.10 47.77 56.878 17866 msc 0.04 0.03 0.079 17869 aio 12.38 54.48 66.8610 17871 aio 12.31 56.77 69.0811 17876 aio 11.48 49.05 60.5312 17880 aio 10.64 46.15 56.7913 17887 aio 11.06 48.38 59.4414 17892 aio 10.14 51.07 61.2115 17896 aio 11.22 51.32 62.5416 17903 shm 10.36 10.58 20.94

tot 1651.48 597.25 2248.73

The command onstat -g ioq helps you determine whether you need to add more AIOVPs and has output as shown in Listing 28.13.

LISTING 28.13

Output of the onstat -g ioq command.

AIO I/O queues:q name/id len maxlen totalops dskread dskwrite dskcopy

adt 0 0 0 0 0 0 0opt 0 0 0 0 0 0 0msc 0 0 1 1423 0 0 0aio 0 0 1 2 1 0 0pio 0 0 1 215 0 215 0lio 0 0 1 241 0 241 0

If this command shows an I/O queue that continues to grow, you might need to add more AIO VPs.

■ Monitoring Shared Memory

The amount of memory used by Info rmix can va ry, d epending on the activity of your system. Memory is initially allocated as the amount of KB given by the onconfig parameterSHMVIRTSIZE and can be dynamically allocated by Informix in KB chunks of SHMADD. Some of the other onconfig values that affect shared memory size include BUFFERS, LOCKS, and LOGBUFF. Informix aborts if the amount of memory reaches the amount in SHMTOTAL. To payattention to how much memory Informix has allocated, simply use the command onstat -, whichquickly displays the total amount used by Informix—for example:

INFORMIX-OnLine Version 7.23.UC1 -- On-Line -- Up 2 days 13:58:15 -- 10336 Kbytes

In this example, Informix uses 10336 KB of shared memory.For information on specific memory segments that were allocated, use the command onstat

-g seg, which displays information, as shown in Listing 28.14.


LISTING 28.14

Output of the onstat -g seg command.

$ onstat -g seg

Segment Summary:

(resident segments are not locked)

id key addr size ovhd class blkused blkfree

0 1387939841 a000000 2392064 848 R 288 4

1 1387939842 a248000 8192000 720 V 325 675

When Info rmix is initially brought online, the amount of memory indicated in SHMVIRTSIZE is allocated. If more memory segments are added, you see them in the output ofonstat –g seg, with a class of V. Alternately, you can watch the message log for messages thatmention dynamically allocated new shared memory segment (size xx). To do so,view the file in a text editor or use onstat -m (to see the last 20 entries in the message log) asshown in Listing 28.15.

LISTING 28.15

Output of the onstat -m command.

Thu May 20 00:08:58 1999

02:10:08 Checkpoint Completed: duration was 7 seconds.

02:10:14 Logical Log 9197 Complete.


03:00:29 dynamically allocated new shared memory segment (size 89161728)

03:03:30 Level 0 Archive started on rootdbs, datadbs


You can also look through the whole message log by using a pattern-matching command suchas grep on UNIX—for example:

grep –i "dynamically allocated" $INFORMIXDIR/prod.log

Remember that in a DSS environment, you can also set the variable DS_TOTAL_MEMORY.This value is typically set to a certain percentage of SHMTOTAL. The percentage depends on howmany DSS versus OLTP queries are performed. The DS_TOTAL_MEMORY usage is monitoredthrough onstat -g mgm. Use of DSS is discussed in more detail in Chapter 4, “UnderstandingInformix Architecture.”

■ Monitoring User Activity

Performance problems can sometimes be totally related to what the users are doing. If randomqueries are allowed on a system or too many queries are taking place, system performance can be greatly reduced. To find out which users are using the Informix instance, use the commandonstat -u, which displays the output shown in Listing 28.16.


LISTING 28.16

Output of the onstat -u command.

Userthreads

address flags sessid user tty wait tout locks nreads nwrites

a270010 ---P--D 0 root - 0 0 0 324 166

a270444 ---P--F 0 root - 0 0 0 0 0

a270878 ---P--B 4 root - 0 0 0 0 3

a270cac ---P--D 0 root - 0 0 0 0 0

a270cbd ---P--D 49 ron - 0 0 0 3224 1423

a270cbd ---P--D 50 bob - 0 0 0 7201 1244

6 active, 128 total, 38 maximum concurrent

A quick glance at this output can tell you who has the highest number of reads and writes. Onsystems with more users, it might be more difficult to monitor, however. The last line of the output(in our example, the line that begins with “6 active”) can help you decide whether the number ofusers is the problem. To see just the end of the listing, type onstat –u|tail –5.

To follow through on a particular query, you can use the session ID (sessid) that is dis-played in onstat -u. Executing the command onstat -g ses sesid displays informationabout that particular session, including the query and more detailed information. For example,Listing 28.17 shows part of the output of the command onstat -g ses 49.

LISTING 28.17

Output of the onstat -g ses command.

session #RSAM total used

id user tty pid hostname threads memory memory

49 ron 4 9798 sparky 1 65536 55480

tid name rstcb flags curstk status

412 sqlexec a271514 Y--P--- 1872 cond wait(netnorm)

Memory pools count 1

name class addr totalsize freesize #allocfrag #freefrag

374 V a42a010 65536 10056 156 8

Sess SQL Current Iso Lock SQL ISAM F.E.

Id Stmt type Database Lvl Mode ERR ERR Vers

374 SELECT stores7 NL Not Wait 0 0 7.14

Current statement name : slctcur

Current SQL statement :

select * from customer

Last parsed SQL statement :

select * from customer


Lots of useful fields are in the output of this command, including the query, memory, user,and other information. You can obtain a smaller piece of the same information by using the command onstat -g sql sesid.

To get information for a particular thread within the session, issue the command onstat-g tpf tid where tid is the value shown in the output of the above command. For example, thecommand onstat –g tpf 412 would display the output shown in Listing 28.18.

LISTING 28.18

Output of the onstat -g tpf command.

Thread profiles

tid lkreqs lkw dl to lgrs isrd iswr isrw isdl isct isrb lx bfr bfw lsus lsmx seq

412 87 0 0 0 0 33 0 0 0 0 0 0 130 0 0 0 0

You can also use onstat -g wai to find threads that are waiting to be processed byInformix as shown in Listing 28.19.

LISTING 28.19

Output of the onstat -g wai command.

Waiting threads:tid tcb rstcb prty status vp-class name2 c9b88158 0 2 sleeping forever 3lio vp 03 c9b883c0 0 2 sleeping forever 4pio vp 04 c9b88658 0 2 sleeping forever 5aio vp 05 c9b888f0 0 2 sleeping forever 6msc vp 06 c9b88b88 0 2 sleeping forever 7aio vp 17 c9b88e20 0 2 sleeping forever 8aio vp 28 c9ba0128 0 2 sleeping forever 9aio vp 39 c9ba0488 c9ab7018 4 sleeping secs: 1 1cpu main_loop()12 c9bb8030 0 3 sleeping forever 1cpu soctcplst13 c9bb8908 0 3 sleeping forever 1cpu sm_listen14 c9bbba68 0 2 sleeping secs: 1 1cpu sm_discon15 c9bbbd00 c9ab74cc 2 sleeping forever 1cpu flush_sub(0)16 c9bbbf98 c9ab7980 2 sleeping forever 1cpu flush_sub(1)20 c9bd0090 c9ab8c50 2 sleeping secs: 4 1cpu btclean36 c9c03660 c9ab9a6c 4 sleeping secs: 1 1cpu onmode_mon2990 c9ce0d38 c9ac30ec 2 cond wait sm_read 1cpu sqlexec2992 cc788d58 c9abef14 2 cond wait sm_read 1cpu sqlexec

Using onstat -g rea shows threads that are ready to be executed as shown in Listing28.20.

LISTING 28.20

Output of the onstat -g rea command.

Ready threads:

tid tcb rstcb prty status vp-class name

2992 cc788d58 c9abef14 2 ready 1cpu sqlexec

2990 c9ce0d38 c9ac30ec 2 ready 1cpu sqlexec


■ Using Performance Monitoring Utilities

Another way to monitor performance and health of your system is through monitoring tools. Thesetools are designed to monitor things like CPU usage, memory usage, and possible problems. Sometools provide support for watching performance within the database, including query monitoring,dbspace usage, and more. Choosing such an automated tool can save a lot of the manual steps thatI mentioned previously. These tools are made to automate these tasks and can greatly simplify thelife of an administrator.

Many options are in these tools and more are becoming available. Some of the tools provid-ed by Informix include Informix Enterprise Command Center (IECC) and onweb. IECC is a GUI-based tool that greatly simplifies administrative and monitoring operations. In addition, severalthird-party tools are available, like Compuware’s ECO Tools and BMC’s Patrol.

■ Using the sysmaster Database

The sysmaster database is created when you initialize disk space for Informix. It contains realtables and pseudo tables. The pseudo tables contain the information in shared memory and providea good deal of the information provided by the on commands. Using sysmaster allows you tobuild your own queries and programs that do monitoring. For example, you can create a programthat monitors much of the information provided by the onstat -p (system profile) command.This lets you create your own alarm conditions and process them accordingly. For a full descriptionof how to use sysmaster, see Chapter 32, “SMI and the sysmaster Database.”

Watching the Operating System and Network

Of course, performance problems can occur on many different levels. If you are experiencingextremely slow keyboard response time, for example, the problem could reside in the network.Likewise, if your Informix instance is starting to have problems accessing some of the disk drives,you might be having hardware errors. Finally, if the system is strapped for memory or CPU, youmight look to other things that are running on the operating system. Remember that Informix issharing disk, memory, and CPU with the operating system, which needs its own resources.

Table 28.2 shows a summary of some of the most common operating system commands.

Table 28.2 ■ Operating system commands

Command Description

vmstat Provides many virtual memory statistics.

iostat Displays I/O information.

sar D i s p l ays info rm ation on all areas of the system, i n cluding CPU and memory. You canuse s a r in many diffe rent way s , but it must be enabled by your system administrat o r.

top or monitor Shows an easy-to-read representation of system activity.

glance HP tool that provides a lot of useful information.

netstat Shows information about the current TCP/IP network connections.

df, dfspace, or bdf Displays current disk space used by operating system files.

ps Gives information about specific processes.


These commands can give you information that you can use in conjunction with the infor-mation you get from Informix. Between the two, you should have a good idea whether the problemis an Informix configuration issue or just a lack of resources (like CPU). These commands aredescribed in more detail in Chapter 25, “Working with the Operating System.”

Many different versions of UNIX exist, and each version has its own set of commands. Keep inmind that all the commands aren’t available on all UNIX systems and if they are, the control argu-ments and syntax might differ. Be sure to use the UNIX man command to get complete detailsabout what your system offers. If the commands aren’t available, find out which utilities your sys-tem offers.

Most operating systems have their own log files. These files can show errors that could lateraffect Informix (I/O errors, for example). They should also be regularly watched. Again, UNIX sys-tems handle logging differently. A common place to look for log files, though, is in the /var/admor /usr/adm directories.

Don’t forget that user applications might be causing the problems. By following some of thepreviously described methods to monitor user sessions, you can trace the original SQL that wasbeing executed by the user. If you find users that are doing certain queries (sequential scan, forexample), you might want to suggest indexes or changes to their applications.

Creating Long-Term Stability

Some ongoing maintenance commands need to be performed, but not as often as many of the com-mands I have already discussed . These commands might best be executed in a batch job that runson a regular schedule. This section describes some of these commands.

■ Updating Statistics

The statistics for your databases help the Informix query optimizer work most effectively.Statistics tell Informix what kind of data is in the database and what its approximate values are.This information helps the query optimizer find the best way to do queries. Some common UPDATE STATISTICS commands are shown in Listing 28.21.

LISTING 28.21

Examples of UPDATE STATISTICS commands.

UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY;UPDATE STATISTICS LOW FOR TABLE customer;UPDATE STATISTICS HIGH FOR TABLE customer(cust_nbr);

The strategy for updating your statistics must be chosen carefully. Chapter 31, “ApplicationTuning,” and the Performance Guide for Informix Dynamic Server explain the proper strategy.


■ Checking Your Tables and Indexes

Sometimes Informix table and index data can become damaged. If the damaged data is notaccessed, no one may be aware of the problem until more damage has occurred. Either case is notthat good. One way to prevent this damage is to regularly check your tables and indexes by issuingthe following commands:

oncheck -cDn -- checks data pages and answers no to questionsoncheck -cIn -- checks index pages and answers no to questions

Note the n supplied for every command. This instructs oncheck to ignore any questionsabout trying to fix the data if an error is found. If you find that you need to correct errors, you canrun oncheck again and manually answer or run the command with y instead of n at the end.Before using the y, review what oncheck attempts to correct. Also note that version 7.30 of IDSadded the -w option, which minimizes table locking during certain oncheck commands; seeChapter 16, “IDS 7.30 Feature Enhancements,” for more details.

In some cases, you may have difficulty using oncheck to fix the problems, or you may not beable to use it at all for that purpose. If the problem is an index, you might be able to drop and re-create the index. If it is data pages, you can try to unload the data but might not be able to unloadall your data.

■ Checking System Information: Reserved Pages and System Catalogs

Each Informix instance has critical information that is contained in its “reserved pages.” Thesepages are a roadmap to the data in the Informix instance, including information about each chunk,checkpoints, archives, general instance configuration (should match onconfig), and other statis-tics. If the reserved pages are damaged, Informix can have serious problems. To validate and pos-sibly correct problems, use the command oncheck -cr. To just display the reserved pages, useoncheck -pr.

Each database has a set of system catalogs that contains information about its tables, index-es, and other items in the database. These catalogs are tables that begin with sys. These catalogsmust have the proper information; if damaged, they can make an entire database inaccessible. Tocheck the catalogs, run the command oncheck -cc.

■ Reviewing Table Structure—Extents and Otherwise

Over a period of time, the data in tables can be spread over multiple disk drives and/or mixed withdata from many other tables. The way to avoid this situation is to create the proper extent—sizes—disk space dedicated to certain tables. Extent planning and management are discussed in detail inChapter 8, “Creating Databases and Tables.”

Tables with too many extents can have a negative impact on Informix response times, becausedata is scattered in many different parts of the disk. A good ballpark maximum number of extentsis 10. You can prevent this problem by creating extents that are large enough to hold your data. Theextent size can be created with the table or added later with the CREATE TABLE or ALTER TABLEstatements. To monitor the amount of extents, try running the command oncheck -pe, which dis-plays output similar to Listing 28.22.


LISTING 28.22

Output of the oncheck -pe command.

[el] Disk usage for Chunk 1 Start Length

------------------------------------------- --------- ---------

stores7:customer 1000 13

stores7:item 1250 12

stores7:customer 3000 2500

The top of the listing shows which chunk is being displayed (Chunk 1); the far left columnshows the table being stored in the disk space (customer and item); the Start column showsthe byte offset to the beginning of this piece of storage (1000, etc.); and the Length shows thelength of the disk space in pages.

If you see a table name that appears many times in this output, its disk space is likelyvery scattered (i.e., fragmented) on the disk. For example, in the previous listing, notice that thecustomer table is listed twice, in non-contiguous disk space. If you look at the complete outputof onstat –pe and find a table like customer frequently shown, now is probably a good timeto re-create it. You might consider re-creating the table by backing it up, dropping, re-creating with-out indexes, re-loading into a new version of the table with the proper initial and next extent sizes,and re-creating the indexes.

The oncheck -pt command shows summarized information of all the tables. This infor-mation can tell you how many extents each table has, as shown in Listing 28.23.

LISTING 28.23

Output of the oncheck -pt command.

TBLspace sysmaster:informix.syscolumnsPhysical Address 100011Creation date 07/01/97 12:15:13TBLspace Flags 2 Row LockingMaximum row size 48Number of special columns 0Number of keys 1Number of extents 7Current serial value 1First extent size 8Next extent size 8Number of pages allocated 64Number of pages used 61Number of data pages 34Number of rows 1700Partition partnum 1048580

Note that Number of extents is 7, still an acceptable number.The last way to find out about all the extent sizes is to query the sysmaster database. This

database is described in detail in Chapter 32, “SMI and the sysmaster Database.” Listing 28.24shows a query you can execute against the sysmaster database to display the number of extentsand size of tables.


LISTING 28.24

Query to display number of extents and size of tables.

DATABASE sysmaster;SELECT dbsname,

tabname,count(*) num_of_extents,sum( pe_size ) total_size

FROM systabnames, sysptnextWHERE partnum = pe_partnumGROUP by 1, 2ORDER BY 3 DESC, 4 DESC;

Correcting and Troubleshooting Problems

You’ve now learned numerous ways to monitor your servers, watch performance, and prevent prob-lems. But sometimes trouble is going to come no matter what you do. It’s inevitable; things gowrong. Here I discuss how to address and correct some common problems.

Listing 28.25 shows what appears in a message log for a typical assert failure.

LISTING 28.25

Example of a serious error displayed in a message log.

:10:51:10 Assert Failed: Page Check Error in btcurrent:bad current node

10:51:10 Who:Session(131, ron, 23994, 554376032)Thread(1371, sqlexec, 21090fec, 4)

10:51:10 Results: Possible inconsistencies in'xx01abcd:"informix".customers'

10:51:10 Action: Run 'oncheck -cDI 6449916'10:51:10 See Also: /DUMPDIR/af.56cb0d, gcore.56cb0d.0,

/DUMPDIR/prob/core10:51:10 Stack for thread: 1402 sqlexec

It doesn’t look promising, does it? Many errors in the message log might cause Informix toimmediately go offline, possibly causing damage to your data and databases. Possible problemsinclude:

• Hardware problems like disk and CPU

• Internal Informix errors—corrupted pages and the like

• Memory errors

• Operating system crashes

• Errant user programs

As an administrator, you need to be prepared to handle any of these errors. Errors occur andmust be properly handled. Rule number one of an administrator is: Don’t panic.

A good number of system crashes cause no damage to the data and are recovered through theInformix powerful recovery system.


■ Checking the Message Log

When your Informix instance goes offline, first check the message log for what happened. The errorin Listing 28.25 demonstrated a message you receive during some system problems. If yourinstance goes offline, you need to view the message log using a text editor; you won’t be able toview it using the onstat –m command—for example: vi /usr/informix/prod.log.

The different types of messages that may appear in the message log are listed in theAdministrator’s Guide for Informix Dynamic Server.

As explained in the earlier section, “Watching the Message Log,” many details about the errorare included in the message log. For example, the mention of an I/O error is a good indication of a problem with hardware.

■ Handling Assertion Failures

A common message for Informix crashes indicates an “assertion failure.” An assertion failure issimply a message from Informix stating that it could not perform a necessary operation and need-ed to shut down. These errors often require a simple shutdown and startup of the instance, but muststill be investigated.

Starting with version 7.30, Informix attempts to capture assertion failures and kill just thatInformix session, automatically saving the necessary debug information. This approach usually pre-vents Informix from crashing during assertion failures. The failure, of course, is documented in themessage log for that Informix instance. An assertion failure almost always creates one or more filesthat give complete information about the error. The most common file is af.nnn where nnn is aunique hex number. Other possible files are a shared memory file (shm.nnn), a core dump of VPprocesses (gcore.nnn), and a regular core dump. The data in these files might mean nothing toyou, but they are very important to Informix support. Check your Informix documentation for dumpfiles specific to your Informix release.

You can configure what happens during an assertion failure by setting certain parameters in theonconfig file. These parameters all begin with the word DUMP and include DUMPCORE, DUMP-SHMEM, DUMPDIR, and DUMPCNT. Setting these parameters helps trap more information duringsuch a failure.

If you set the environment variable $AFDEBUG before you initialize Informix, the engine sus-pends processing—rather than crashing—during many errors. This variable is useful because itallows you to run certain diagnostic commands like oncheck and onstat before bringing theengine down. Here is an example of how to set this when initializing Informix:

AFEDEBUG=1; export AFDEBUGoninit

Because of the improvements for assertion fa i l u re handling in 7.3x, setting $ A F D E B U G is not re a l-ly necessary; Info rmix continues running and cap t u res erro rs any way. In the ra re cases when ana s s e rtion fa i l u re crashes Info rm i x , h oweve r, $ A F D E B U G still wo rks to suspend Info rmix pro c e s s-i n g. This action allows you to cap t u re error info rm ation befo re bri n ging down the Info rmix engi n e.

—Much of the information about assertion failures was contributed by Stefanie Vario and Mark Stock


As discussed previously and displayed in Listing 28.25, Informix gives a lot of informationduring an assertion failure, often including a possible solution to the problem (Action: Run‘oncheck -cDI 6449916’). Be sure to carefully analyze this information to find out whatcaused the problem, and if appropriate, try the corrective action. When an assertion failure occurs,you should:

1. Check the log file to find the process that has caused the assertion failure. Note that inListing 28.25, the line

(131, ron, 23994, 554376032) Thread(1371, sqlexec, 21090fec, 4)

shows that session 131 and thread 1371 caused the error.

2. If the instance is suspended or still online, you can run certain commands to gather moreinformation (in 7.3x, the engine should remain online). The output of these commandsshould be saved in a directory created for this purpose. You can use these files for your ownhistory and for Informix technical support. Some of the recommended commands include:

onstat -uonstat -g ses session_id_that_causedonstat -g stk thread_id_within_sessiononstat -g stsonstat -g gloonstat -g segonstat -g mem

3. Look through the history of your system crashes for a similar error. If found, note whatcaused it and how it was resolved.

If a certain assertion failure occurs often, a good chance exists that it is due to a bug in the ver-sion of Informix you are using. It means that something in your applications is triggering the error.If so, try calling Informix technical support or obtaining an upgraded version of your Informixsoftware.

4. Try to determine the cause of the erro r. If you are unable to determine the cause, you might want to call Info rmix technical support at this time. If your engine is suspended (via$ A F D E B U G or otherwise), I n fo rmix can run diffe rent o n s t a t and o n c h e c k c o m m a n d s .

5. Check to see whether the Informix instance is still online by running onstat – from the UNIX command line. If the status indicates On-Line, your instance is still running; if not, perform the next three steps.

6. Bring the engine down by using onmode -kuy. Do keep in mind, however, that this willkill all user processes (though they may be hung already). Sometimes this does not bringInformix offline. If it doesn’t, you have to remove all shared memory segments and sema-phores associated with this instance (probably with ipcrm)—just be sure not to kill anyfrom other Informix instances. After that, you can use kill to eliminate all oninitprocesses associated with this instance.

Please remember that a kill is a last resort. A kill -15 (usually the default signal) is alot more graceful than a kill -9, and killing the master daemon is usually enough toremove all oninit processes.

7. Attempt to bring Info rmix back online with your usual method (including o n i n i t). IfI n fo rmix goes right back into error mode, i t ’s pro b ably time to call Info rmix technical support .


8. If the message log suggested a corrective action (oncheck -cDI, for example), try run-ning that command before allowing the users online.

In most cases, the previous three steps are sufficient to bring Informix back online and youdon’t have to take the process any further. If they are not sufficient, you need to apply some realadministrator’s skills (or call Informix!).

■ Using What You Know

After a while, Informix administrators develop a sixth sense about problems. Through experience,you can either recognize errors or intuitively know how to fix them. These skills are hard to teachin a book; they come naturally. But they are a way of thinking.

Really consider what happened and what could have caused it. Logically think through themessages and conditions of the system and try to eliminate the obvious. Consider what has hap-pened with your system in the past. Run various Informix commands to try to trace problems.Remember that some crashes are caused by problems with the current Informix release and yourapplications. Again, don’t panic, and logically walk through the problem.

Following is an interesting story from Clem Akins, former Informix Technical SupportEngineer. Notice how Clem did follow the logical steps but had to really improvise to find the problem.

The error was from an Assert Fail due to a memory Segmentation Violation. The engine wasgiven an address that was out of bounds, and it crashed. After one week of all-day-long on-site effort, we proved that the fault was with a part of the hard drive that was used for tempstorage. Informix would write one address there and retrieve another. The internal boundschecking was good enough to find the fault and bail out before corrupting any data. Becausethe customer was a financial institution, they appreciated that, even after the hard time theygave us at first. We ran oncheck as we were supposed to do, but found no errors. We exam-ined the stack trace, and compared it to the optimizer source code that was running when itcrashed. We sent the core dumps and .af files to Informix Advanced Support for furtheranalysis. We reproduced the error using the stores7 database (though we had to make someof the tables several times bigger to see the error.). Advanced Support finally built a machineexactly like the customer’s, ran the exact query on the same data, and had no errors. Giventhat, we gave some heavy thought and conducted long, thorough analysis and found the baddisk.

That story demonstrates a time when the obvious didn’t solve the problem, and it took realimprovisation. If something just doesn’t make sense, start thinking about other things that couldhave caused the problem.

■ Calling Informix Technical Support

Informix provides excellent technical support. In some instances, either you have exhausted all yourideas or you can’t fix the problem. At that time, you should call technical support at 800-274-8184.Be sure to have your information together before you call (including serial numbers, exact errormessages, etc.). Be sure to remember that you generally need to have OpenLine support to resolveproblems. Again, take a look at the views of Clem Akins, long-time Informix support engineer.

FINDING AN ERROR THE LONG AND HARD WAY


The most important thing to do is to take the view of the support engineer. This viewincludes things like:

Can I reproduce the problem at will?

Have I thought about why this problem occurs?

What are the possible causes?

Can I isolate them and test them individually?

What has changed that caused the problem? (System, network, database, applicationchanges?)

Is the test case the simplest possible that reproduces the problem? Does it reproduce on thestores7 database?

Does it reproduce with only simple SQL instead of requiring complex custom code or appli-cations or tables?

Do I have a dial-up agreement in place with Informix Support? (Gee, I wish I had thought ofthat during business hours.) What is my support contract number (or product serial number)and what kind of support do I have? Is all of my support information ready to provide toInformix?

When I talk to the support engineer, will I sound like an IT professional who has taken all thereasonable steps to solve the problem? Even better, like a person who also has to support codeand who thinks about the poor soul who is trying to help me from the other end of a phone?

Determining the answers to these questions saves time for both you and the engineer. And youjust might solve the problem when answering the questions!

■ Restoring from a Backup

Certain errors are just plain unrecoverable. In these times, you might see the dreaded “restore froman archive” error. At this point, the administrator looks like a hero if he has implemented a strongbackup and recovery plan (of course we all do!).

Sometimes an Informix error message suggests, “Restore from a backup.” This statement is notalways true! Many times, it is just a matter of the Informix engine being temporarily confusedabout what is happening. Even when the “restore from a backup” message is displayed, I strong-ly suggest going through all steps to attempt to get the instance back online. If they do not work,try calling technical support before doing the restore, which could cause a great loss of data.

Two kinds of re s t o res exist—cold and wa rm. The type you choose depends on your erro r.R e fer to the Info rmix documentation for the backup strat egy you are using (o n b a r, o n t a p e,o n a r c h i v e, and so forth) or see Chapter 27, “Planning and Using Informix Backups.”

HOW TO PREPARE FOR CALLING TECHNICAL SUPPORT


The amount of data you restore depends on your levels of backup. Generally speaking, theprocess is to:

1. Restore your last level-0 backup.

2. Restore the most recent level-1 and level-2 backups since the level-0, if any.

3. Restore all proper logical logs, if any.

If you are not using logging, you can go only as far as step 2. The point of restoration dependson your backup schedule. If you need to get as close as possible to the point in time of your data-base, you need to have implemented logged databases and properly backed up all the logical logs.

FOR MORE INFORMATION . . .

This chapter described the ongoing process of maintaining your Informix servers. You learned someof the basic commands involved with setting up a server and how to prepare for the future. I showedyou several commands that can be used on an ongoing basis to watch your servers for possible prob-lems or bottlenecks. Finally, I explained how to contact Informix support during down times. Formore information related to this chapter, please note the following:

• For an explanation of many of the basic concepts of Informix—as well as detailed infor-mation about many of the things discussed in this chapter—see Chapter 4, “UnderstandingInformix Architecture.”

• To see how to perform the setup involved with an IDS server, see Chapter 21, “Setting UpUNIX Database Servers.”

• For instructions on how to work with the operating system, see Chapter 25, “Working withthe Operating System.”

• For descriptions of the various on commands, see Chapter 26, “Administration Utilities(the on* commands).”

• For an overview of the tables in sysmaster, see Chapter 32, “SMI and the SysmasterDatabase.”

• To get tuning tips for administrators, see Chapter 33, “System Tuning.”

INFORMIX AND OTHER REFERENCES

• For information about configuring an Informix instance and monitoring commands, see the Administrator’s Guide for Informix Dynamic Server.

• For detailed information about backing up and restoring your data, see the Archive andBackup Guide and the Backup and Restore Guide.

• To get information about using Informix’s Enterprise Command Center, see the InformixEnterprise Command Center User Guide.

• For re c o m m e n d ations on tuning your Info rmix dat abase server instances, see thePerformance Guide for Informix Dynamic Server.

Documents

Troubleshooting and Maintaining Your Servers