101
Rhino Troubleshooting Guide TAS-007-Issue 2.5.0-Release 1 September 2018

Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide

TAS-007-Issue 2.5.0-Release 1

September 2018

Page 2: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Notices

Copyright © 2017 Metaswitch Networks. All rights reserved.

This manual is issued on a controlled basis to a specific person on the understanding that no part of the Metaswitch Networks product code or

documentation (including this manual) will be copied or distributed without prior agreement in writing from Metaswitch Networks.

Metaswitch Networks reserves the right to, without notice, modify or revise all or part of this document and/or change product features or

specifications and shall not be responsible for any loss, cost, or damage, including consequential damage, caused by reliance on these materials.

Metaswitch and the Metaswitch logo are trademarks of Metaswitch Networks. Other brands and products referenced herein are the trademarks or

registered trademarks of their respective holders.

2

Page 3: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Contents

1 Rhino Troubleshooting Guide........................................................................................................................................ 13

2 Common Symptoms........................................................................................................................................................14

2.1 Rhino watchdog.......................................................................................................................................................................................... 14

2.1.1 Stuck Threads watchdog.............................................................................................................................................................. 14

Symptoms................................................................................................................................................................................... 14

2.1.2 Timewarp watchdog......................................................................................................................................................................16

Long Garbage Collection(GC) pause..........................................................................................................................................16

System clock changes................................................................................................................................................................ 17

Virtual Machine pauses...............................................................................................................................................................17

Excessive system IO waits..........................................................................................................................................................17

2.2 Long-running transaction warnings.............................................................................................................................................................17

2.2.1 Symptoms..................................................................................................................................................................................... 18

2.2.2 Diagnostic steps and correction....................................................................................................................................................18

2.3 Lock timeouts..............................................................................................................................................................................................18

2.3.1 Symptoms..................................................................................................................................................................................... 19

2.3.2 Diagnostic steps and correction....................................................................................................................................................19

2.4 I/O exceptions.............................................................................................................................................................................................21

2.4.1 I/O exceptions from Savanna in Rhino logs..................................................................................................................................22

2.4.2 Symptoms..................................................................................................................................................................................... 22

2.4.3 Diagnostic steps and correction....................................................................................................................................................22

2.4.4 I/O exceptions from Rhino’s logging infrastructure in Rhino logs................................................................................................. 22

2.4.5 Symptoms..................................................................................................................................................................................... 22

3

Page 4: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2.4.6 Workaround or resolution..............................................................................................................................................................22

2.4.7 I/O exceptions from Rhino’s core in Rhino logs............................................................................................................................ 22

2.4.8 Symptoms..................................................................................................................................................................................... 22

2.4.9 Workaround or resolution..............................................................................................................................................................23

2.4.10 I/O exceptions from resource adaptors or SBBs in Rhino logs...................................................................................................23

2.5 Inactive RA provider................................................................................................................................................................................... 23

2.5.1 Symptoms..................................................................................................................................................................................... 23

2.5.2 Resolution..................................................................................................................................................................................... 23

2.6 Leaked OCIO buffers..................................................................................................................................................................................23

2.6.1 Symptoms..................................................................................................................................................................................... 23

2.6.2 Diagnostic steps............................................................................................................................................................................24

2.6.3 Profile Management and/or provisioning failing............................................................................................................................ 24

Symptoms................................................................................................................................................................................... 24

Diagnostic steps..........................................................................................................................................................................24

Resolution................................................................................................................................................................................... 26

2.7 License over capacity alarms..................................................................................................................................................................... 26

2.8 Rhino logs containing exceptions............................................................................................................................................................... 27

2.8.1 Resource Adaptors throwing Exceptions...................................................................................................................................... 27

3 Environmental..................................................................................................................................................................28

3.1 Operating environment issues.................................................................................................................................................................... 28

3.1.1 Symptoms..................................................................................................................................................................................... 28

3.1.2 Diagnostic steps............................................................................................................................................................................28

3.1.3 Workaround or resolution..............................................................................................................................................................29

3.2 Java Virtual Machine heap issues.............................................................................................................................................................. 30

3.2.1 Symptoms..................................................................................................................................................................................... 30

4

Page 5: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

3.2.2 Diagnostic steps............................................................................................................................................................................30

3.2.3 Workaround or resolution..............................................................................................................................................................31

3.3 Application or resource adaptor heap issues..............................................................................................................................................32

3.3.1 Symptoms..................................................................................................................................................................................... 32

3.3.2 Diagnostic steps............................................................................................................................................................................32

3.3.3 Workaround or resolution..............................................................................................................................................................34

3.4 Rhino start-up fails with 'java.io.IOException: Not Enough Space'.............................................................................................................34

3.4.1 Symptoms..................................................................................................................................................................................... 34

3.4.2 Diagnostic steps............................................................................................................................................................................35

3.4.3 Workaround or resolution..............................................................................................................................................................35

3.5 Warning about UDP buffer sizes................................................................................................................................................................ 35

3.5.1 Symptoms..................................................................................................................................................................................... 36

3.5.2 Resolution..................................................................................................................................................................................... 36

3.6 Java Virtual Machine error..........................................................................................................................................................................36

3.6.1 Symptoms..................................................................................................................................................................................... 36

3.6.2 Diagnstic steps..............................................................................................................................................................................38

3.6.3 Workaround or resolution..............................................................................................................................................................38

3.7 Multicast traffic is using the wrong network interface................................................................................................................................. 39

3.7.1 Symptoms..................................................................................................................................................................................... 39

3.7.2 Diagnostic steps............................................................................................................................................................................39

3.7.3 Workaround or resolution..............................................................................................................................................................40

4 Clustering......................................................................................................................................................................... 41

4.1 Node failure................................................................................................................................................................................................ 41

4.1.1 Configuration change messages appear in Rhino log output....................................................................................................... 41

Symptoms................................................................................................................................................................................... 41

5

Page 6: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Diagnosis and Resolution........................................................................................................................................................... 41

4.1.2 An alarm indicates a node has failed............................................................................................................................................ 41

Symptoms................................................................................................................................................................................... 41

Diagnosis and Resolution........................................................................................................................................................... 42

4.1.3 A Rhino node exits the JVM..........................................................................................................................................................42

Symptoms................................................................................................................................................................................... 42

4.1.4 Out of memory errors....................................................................................................................................................................42

Symptom..................................................................................................................................................................................... 42

Diagnosis and Resolution........................................................................................................................................................... 43

4.1.5 JVM errors.................................................................................................................................................................................... 43

4.1.6 Watchdog timeouts....................................................................................................................................................................... 43

Symptoms................................................................................................................................................................................... 43

Diagnosis and Resolution........................................................................................................................................................... 44

4.2 Cluster segmentation..................................................................................................................................................................................44

4.2.1 Symptoms..................................................................................................................................................................................... 44

4.2.2 Diagnostic steps............................................................................................................................................................................44

4.2.3 Workaround or resolution..............................................................................................................................................................44

4.3 Cluster failing to start.................................................................................................................................................................................. 45

4.3.1 No Primary Component................................................................................................................................................................ 45

Symptoms................................................................................................................................................................................... 45

Resolution................................................................................................................................................................................... 45

4.3.2 Multicast Configuration Error........................................................................................................................................................ 46

Symptoms................................................................................................................................................................................... 46

4.4 Cluster starts but stops after a few minutes................................................................................................................................................47

4.4.1 Symptoms..................................................................................................................................................................................... 47

6

Page 7: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.4.2 Diagnostic steps............................................................................................................................................................................48

4.4.3 Workaround or resolution..............................................................................................................................................................49

4.5 Rhino SLEE fails to start cluster groups..................................................................................................................................................... 50

4.5.1 Symptoms..................................................................................................................................................................................... 50

4.5.2 Diagnostic steps............................................................................................................................................................................50

4.5.3 Workaround or resolution..............................................................................................................................................................51

4.6 Group heartbeat timeout.............................................................................................................................................................................51

4.6.1 Symptoms..................................................................................................................................................................................... 51

4.7 Scattercast endpoints out of sync...............................................................................................................................................................52

4.7.1 Symptoms..................................................................................................................................................................................... 52

5 Performance.....................................................................................................................................................................56

5.1 High Latency...............................................................................................................................................................................................56

5.1.1 Symptoms..................................................................................................................................................................................... 56

5.1.2 Diagnostic steps............................................................................................................................................................................56

Low number of available threads in Rhino statistics output........................................................................................................ 57

High staging queue size in Rhino statistics output......................................................................................................................57

Dropped staging items................................................................................................................................................................ 57

High event processing time.........................................................................................................................................................57

5.1.3 Workaround or Resolution............................................................................................................................................................ 58

5.2 Dropped Calls............................................................................................................................................................................................. 58

5.2.1 Symptoms..................................................................................................................................................................................... 58

5.2.2 Diagnostic steps............................................................................................................................................................................59

5.2.3 Rhino logs containing exceptions................................................................................................................................................. 59

Resource Adaptors throwing Exceptions.................................................................................................................................... 59

5.2.4 Lock timeout messages in Rhino logs and/or console..................................................................................................................59

7

Page 8: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

5.2.5 Rate limiting.................................................................................................................................................................................. 60

5.2.6 A dependant external system is not functioning properly............................................................................................................. 60

5.3 A newly started node is unable to handle full traffic load............................................................................................................................61

5.3.1 Symptoms..................................................................................................................................................................................... 61

5.3.2 Workaround or Resolution............................................................................................................................................................ 61

5.4 Uneven CPU load/memory usage across cluster nodes............................................................................................................................ 62

5.4.1 Symptoms..................................................................................................................................................................................... 62

5.4.2 Diagnostic steps and correction....................................................................................................................................................62

6 Configuration Problems..................................................................................................................................................63

6.1 Security Related Exceptions.......................................................................................................................................................................63

6.1.1 Various connection related exceptions......................................................................................................................................... 63

Symptoms................................................................................................................................................................................... 63

Resolution................................................................................................................................................................................... 65

6.1.2 Various permission related exceptions......................................................................................................................................... 65

6.1.3 Symptoms..................................................................................................................................................................................... 65

6.1.4 Diagnosis and Resolution............................................................................................................................................................. 65

6.2 Memory Database Full................................................................................................................................................................................66

6.2.1 Profile Management and/or provisioning failing............................................................................................................................ 66

Symptoms................................................................................................................................................................................... 66

Resolution and monitoring.......................................................................................................................................................... 68

6.2.2 Deployment failing........................................................................................................................................................................ 69

Symptoms................................................................................................................................................................................... 69

Resolution and monitoring.......................................................................................................................................................... 71

6.2.3 Calls not being setup successfully................................................................................................................................................ 71

Resolution and monitoring.......................................................................................................................................................... 72

8

Page 9: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

6.2.4 Resizing MemDB Instances..........................................................................................................................................................73

6.3 Resource Adaptors refuse to connect using TCP/IP.................................................................................................................................. 74

6.3.1 Diagnostic steps............................................................................................................................................................................74

6.3.2 Workaround or Resolution............................................................................................................................................................ 75

6.4 Local hostname not resolved properly........................................................................................................................................................75

6.4.1 Symptoms..................................................................................................................................................................................... 75

6.4.2 Diagnostic steps............................................................................................................................................................................75

6.4.3 Workaround or Resolution............................................................................................................................................................ 76

7 Management.....................................................................................................................................................................77

7.1 Connections Refused for the Command Console, Deployment Script or Rhino Element Manager........................................................... 77

7.1.1 Symptoms..................................................................................................................................................................................... 77

7.1.2 Diagnostic steps and correction....................................................................................................................................................78

Rhino is not listening for management connections....................................................................................................................78

Rhino refuses connections..........................................................................................................................................................78

Management client is not configured to connect to the Rhino host.............................................................................................79

7.2 A Management Client Hangs......................................................................................................................................................................79

7.2.1 Symptoms..................................................................................................................................................................................... 79

7.2.2 Workaround or Resolution............................................................................................................................................................ 80

7.3 Statistics client reports “Full thread sample containers”............................................................................................................................. 80

7.3.1 Symptoms..................................................................................................................................................................................... 80

7.4 Statistics Client Out of Memory.................................................................................................................................................................. 80

7.4.1 Symptoms..................................................................................................................................................................................... 80

7.4.2 Workaround or Resolution............................................................................................................................................................ 81

7.5 Creating a SyslogAppender gives an AccessControlException................................................................................................................. 81

7.5.1 Symptoms..................................................................................................................................................................................... 81

9

Page 10: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.5.2 Workaround or Resolution............................................................................................................................................................ 82

7.6 Platform Alarms.......................................................................................................................................................................................... 82

7.6.1 Symptoms..................................................................................................................................................................................... 82

7.6.2 Diagnostic steps............................................................................................................................................................................82

7.6.3 Workaround or Resolution............................................................................................................................................................ 83

7.7 DeploymentException when trying to deploy a component........................................................................................................................ 84

7.7.1 Symptoms..................................................................................................................................................................................... 84

7.7.2 Diagnostic steps............................................................................................................................................................................84

7.7.3 Workaround or Resolution............................................................................................................................................................ 84

7.8 Deploying to multiple nodes in parallel fails................................................................................................................................................85

7.8.1 Symptoms..................................................................................................................................................................................... 85

7.8.2 Diagnostic steps............................................................................................................................................................................85

7.8.3 Workaround or Resolution............................................................................................................................................................ 85

7.9 Management of multiple Rhino instances...................................................................................................................................................86

7.9.1 Symptoms..................................................................................................................................................................................... 86

7.9.2 Workaround or Resolution............................................................................................................................................................ 86

7.10 Deployment problem on exceeding DB size.............................................................................................................................................86

7.10.1 Symptoms................................................................................................................................................................................... 86

7.11 Diagnostic steps....................................................................................................................................................................................... 86

7.12 BUILD FAILED when installing an OpenCloud product............................................................................................................................87

7.12.1 Symptoms................................................................................................................................................................................... 87

7.12.2 Diagnostic steps..........................................................................................................................................................................87

7.12.3 Workaround or Resolution.......................................................................................................................................................... 87

7.13 REM connection failure during management operations..........................................................................................................................88

7.13.1 Symptoms................................................................................................................................................................................... 88

10

Page 11: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.13.2 Diagnostic steps..........................................................................................................................................................................88

7.13.3 Workaround or Resolution.......................................................................................................................................................... 88

7.14 Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupported............................................. 88

7.14.1 Symptoms................................................................................................................................................................................... 88

7.14.2 Workaround or Solution.............................................................................................................................................................. 89

7.15 Unused log keys configured in Rhino....................................................................................................................................................... 89

7.15.1 Symptoms................................................................................................................................................................................... 89

7.15.2 Workaround or Resolution.......................................................................................................................................................... 89

7.16 Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT........................................................................................ 89

7.16.1 Symptoms................................................................................................................................................................................... 89

7.16.2 Diagnostic steps..........................................................................................................................................................................90

7.16.3 Workaround or resolution............................................................................................................................................................90

7.17 Log level for trace appender not logging.................................................................................................................................................. 90

7.17.1 Symptoms................................................................................................................................................................................... 90

7.17.2 Workaround or Resolution.......................................................................................................................................................... 90

7.18 Access to REM fails with Command CHECK_CONNECTION invoked without connection ID................................................................ 91

7.18.1 Symptoms................................................................................................................................................................................... 91

7.18.2 Workaround or Resolution.......................................................................................................................................................... 91

8 Database...........................................................................................................................................................................92

8.1 Management Database Server Failure.......................................................................................................................................................92

8.1.1 Symptoms..................................................................................................................................................................................... 92

8.1.2 Resolution and Mitigation..............................................................................................................................................................93

9 Signalware........................................................................................................................................................................94

9.1 CGIN RA to Signalware Backend Connection Errors................................................................................................................................. 94

9.1.1 Symptoms..................................................................................................................................................................................... 94

11

Page 12: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

9.1.2 Diagnostic steps............................................................................................................................................................................94

9.1.3 Workaround or Resolution............................................................................................................................................................ 95

9.2 CGIN RA Cannot Create Outgoing Dialogs................................................................................................................................................95

9.2.1 Symptoms..................................................................................................................................................................................... 95

9.2.2 Diagnostic steps............................................................................................................................................................................96

9.2.3 Workaround or Resolution............................................................................................................................................................ 96

9.3 CGIN RA Cannot Receive Incoming Dialogs..............................................................................................................................................96

9.3.1 Symptoms..................................................................................................................................................................................... 96

9.3.2 Diagnostic steps............................................................................................................................................................................96

9.3.3 Workaround or Resolution............................................................................................................................................................ 97

9.4 Problems with Signalware not involving the CGIN backends..................................................................................................................... 97

10 Exit Codes...................................................................................................................................................................... 98

10.1 Rhino Exit Codes...................................................................................................................................................................................... 98

10.2 JVM Exit Codes........................................................................................................................................................................................ 98

10.2.1 Internal JVM Exit Codes............................................................................................................................................................. 99

10.2.2 Signal Handler Exit Codes.......................................................................................................................................................... 99

10.3 Other exit codes......................................................................................................................................................................................101

12

Page 13: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

1 Rhino Troubleshooting Guide

This document contains troubleshooting help for the Rhino TAS and other OpenCloud products.

Section Troubleshooting for …

Environmental environmental issues with Rhino.

Clustering clustering issues with Rhino.

Performance Rhino performance issues.

Configuration Problems Rhino configuration.

Management Rhino management tools and utilities.

Common Symptoms common general issues with Rhino.

Database the Rhino management database persistent store.

Signalware using Signalware with Rhino’s CGIN Resource Adaptor.

Exit Codes Rhino exit codes.

13

Page 14: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2 Common Symptoms

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for common general issues with Rhino.

2.1 Rhino watchdog

The watchdog is a thread in Rhino that monitors the system clock and threads in the SLEE for strange behaviour. There are two types of threads

monitored:

• DeliveryThreads — internal threads inside the SLEE

• StageWorkers — threads responsible for executing deployed service logic.

The watchdog thread watches for the following strange behaviour:

• threads becoming “stuck” on page 14

• threads dying on page 14

• the system clock suddenly jumping forwards from the point of view of the JVM on page 16

• the system clock suddenly jumping backwards from the point of view of the JVM on page 16

2.1.1 Stuck Threads watchdog

The stuck threads watchdog behaviour handles stuck or dying threads. Under some circumstances as discussed below, the node will leave the

cluster and exit. If configured for restarting, the node will restart, rejoin the cluster and continue operating.

Symptoms

If a thread becomes stuck, the following appears in the logs:

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has become stuck: StageWorker/0

If a thread dies, the following appears in the logs:

14

Page 15: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-05-10 12:34:54.487 ERROR [watchdog.threadactive] <Global watchdog thread> Thread has died: StageWorker/0

If the watchdog detects that a DeliveryThread has become stuck or has died, that node will exit with the following message and a stack trace:

2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***2016-05-10 12:34:56.509 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: ThreadActive:DeliveryThread/0

The watchdog monitors the number of alive StageWorker threads to ensure that more than 50% of them are alive. If fewer that 50% of the

StageWorker threads have remained alive without terminating or becoming stuck, then the watchdog will cause the node to exit with an error

message similar to the above. This proportion of surviving threads is configurable; for information about configuring the watchdog, see the Rhino

Production Getting Started Guide .

Conditions under which a StageWorker becomes stuck or dies are usually indicative that a deployed service is experiencing unexpected

problems or has a faulty implementation. In this scenario, the best approach is to examine the reasons for the service causing threads to block or

exit. Common culprits include excessive blocking on external resources such as JDBC connections, infinite loops or poor handling of unexpected

circumstances.

The watchdog timeout values for StageWorker threads can be increased, However this is not usually recommended as a blocked StageWorker

indicates a session with very long processing time. This may be a solution when a service consistently needs to perform a blocking action that

is expected to take a long time. It is also possible to increase the number of StageWorker threads; this will improve reactiveness when a large

proportion of StageWorker threads are blocking on an external service or long computation. Increasing the StageWorker thread count is

strongly recommended when increasing the timeout.

Conditions under which a DeliveryThread becomes stuck or dies usually indicate that there is a fault in the hardware, operating system,

libraries, or in the Rhino SLEE. If this is the case, first determine if hardware is an issue. It would be a good idea to refer to any available

documentation about diagnoses of hardware problems for the platform in question. Look at the output of the dmesg command on Unix operating

systems, and whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth.

In the case of a DeliveryThread becoming stuck or dying, investigate and resolve any possible hardware issues. Contact your solution provider

with the stack trace information from the Rhino logs and circumstances under which the watchdog timeout occurred for information about what

further action could be taken to resolve this issue.

15

Page 16: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2.1.2 Timewarp watchdog

The timewarp watchdog behaviour handles the system clock suddenly changing from the JVM perspective. A warning is printed in the log if the

system clock suddenly changes. A Critical alarm is raised whenever a timewarp is detected, and will not be automatically cleared. This alarm is

considered critical as all timewarps indicate severe problems in the underlying system that threaten cluster stability and ability to process sessions

in a timely manner.

A timewarp of 5s or greater is sufficient to break the clustering mechanism used in Rhino, and may cause the node to leave the cluster. This is

done to prevent a split-brain situation, or data corruption. A single node cluster will not self shutdown due to timewarps, but this should still be

considered a critical issue as the node is unavailable during the timewarps.

Rhino may fail if system time ever goes backwards on a single node while the cluster is running, or if it goes backwards to overlap with a time

period when the cluster was previously running. This is because the unique IDs being generated may no longer be unique.

If the watchdog observes a sudden change in the system clock, the following will be printed on the console:

2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879 ERROR [watchdog] 2016-08-29 17:00:37.879ERROR [watchdog]<Global watchdog thread> Forward timewarp detected! interval=8021 but should be at most 2000<Global watchdog thread> old timestamp=1188363629857, new timestamp=1188363637878<Global watchdog thread> Check for CPU overloading or stop-the-world GC.<Global watchdog thread> Check for external processes that could step the system clock forwards ('date -s', NTP in step mode, etc)

Several causes of timewarps exist.

Long Garbage Collection(GC) pause

The most likely cause of forward timewarps is GC. GC cannot cause a reverse timewarp. To check GC pause times, look for entries in the console

log reporting real times greater than 1s for ParNew , CMS-initial-mark , and CMS-remark . Also look for entries reporting Full GC or

concurrent-mode-failure . concurrent-mode-failure indicates a severe problem.

2016-04-14 14:07:22.387 2016-04-14T14:07:19.279+0200: [GC2016-04-14T14:07:19.279+0200: [ParNew (promotion failed): 127844K->127844K(130112K),3.1073460 secs] 1766087K->1792457K(3144768K), 3.1078370 secs] [Times: user=4.05 sys=0.00, real=3.11 secs]

16

Page 17: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-04-14 14:07:23.865 2016-04-14T14:07:22.387+0200: [Full GC2016-04-14T14:07:22.387+0200: [CMS: 1664612K->254343K(3014656K), 1.4772760 secs]1792457K->254343K(3144768K), [CMS Perm : 77532K->77327K(196608K)], 1.4775730 secs] [Times: user=1.50 sys=0.00, real=1.47 secs] 2015-09-24 22:09:13.668 (concurrent mode failure): 22907K->8498K(32768K), 0.0504460 secs] 49900K->8498K(65344K), [CMS Perm : 15238K->15237K(196608K)], 0.0796580 secs] [Times: user=0.20 sys=0.00, real=0.08 secs]

Resolving long GC pauses almost always requires modifying the heap size/new size to eliminate the long pause. This must be done in tandem

with performance testing to verify the effects of changes.

System clock changes

This is most commonly caused by a misconfigured time synchronization daemon (usually NTP). Rhino requires that NTP be run in slew mode, as

step mode may make steps greater than the timewarp limit when enabled. NTP steps may be both forwards, and reverse.

Virtual Machine pauses

Virtual machines may pause processing for many reasons, and may produce both foward and reverse timewarps depending on hypervisor

configuration. In order to minimize VM pauses, we strongly recommend avoiding overcommiting resources.

Excessive system IO waits

Excessive system IO waits may only cause foward timewarps. These occur when the host fails to handle disk IO in a timely manner, blocking the

JVM. This may be diagnosed through OS level logging and IO monitoring tools. Other causes of forward timewarps should be investigated first.

Note: VM’s are particularly prone to this form of timewarp.

2.2 Long-running transaction warnings

Long-running transaction warning messages appear in Rhino logs if a transaction has been running for too long. Long-running transactions can be

caused when:

• a profile has been opened for editing but has not been committed or restored

• a large deployable unit is being installed, done in a single transaction, and taking a long time to compile

• a single node is restarting

17

Page 18: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Nodes load most of their state from the database in a single transaction, which can take some time.

• a service makes blocking calls to external resources that are very slow to respond

• the cluster is running under excessive load.

2.2.1 Symptoms

Warning messages like the following in the Rhino logs or console indicate long-running transactions.

...WARN [transaction.manager] There are 1 long-running transactions currently active:WARN [transaction.manager] 43190ms: [OCTransactionImpl:TransactionId:[101:16]:ACTIVE]...

2.2.2 Diagnostic steps and correction

A long-running transaction may be reported during installation of components during deployment or when a node is starting up. In this case, the

error message is benign; and the warning will be cleared when the components have finished installing.

If this message appears only during start-up or deployment of components, then it can be safely ignored.

If such messages persist or occur during normal operation, then they should be reported to your solution provider. Please gather the output of the

dumpthreads.sh command from the node directory. This output is written to the Rhino console and is not gathered as part of Rhino’s logging

system. It can usually be found in the file console.log. Please provide this output to your solution provider support.

2.3 Lock timeouts

Lock timeouts occur when a node cannot obtain a cluster-wide lock within a certain timeout. Lock timeouts are most likely to occur when nodes

leave the cluster, when nodes are experiencing overload conditions, or when network problems are being experienced. Some possible causes

are:

1. Overloaded nodes

2. Network congestion

18

Page 19: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

3. Network failure

4. Nodes which are about to leave the cluster due to some failure (e.g. garbage collection pauses)

5. Too much contention for the lock in question,

that users of the lock are queued sufficiently long enough for their requests to time out.

6. Too much load in general on the system,

such that the lock request can’t be processed by the in-memory database before it times out.

7. Deadlock,

when the locking resources are accessed in a different order.

8. A transaction has somehow been "lost" and left open holding a lock

this could be a product bug.

9. A thread has got stuck, holding its transaction open and therefore holding its lock,

this has various possible causes including: service bugs, JVM bugs and/or product bugs.

2.3.1 Symptoms

Below are examples of warning messages in Rhino logs or console that indicate lock timeouts.

...

... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ... ...

...

...

... ========= Lock timeout ========= ...

...

... Unable to acquire lock on [lockId] within 5000 ms ...

...

2.3.2 Diagnostic steps and correction

The diagnostic process determines whether a failure has occurred due to environmental issues or hardware, network, or software related issues.

A hardware, network, or software related issue is usually indicated by the presence of an alarm. In order to check for alarms, use the following

rhino-console command.

19

Page 20: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

./client/bin/rhino-console listactivealarms

This command produces one line of output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that

a node has left the cluster due to hardware, network, or software failure.

...Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster...

If an alarm indicates node failure, follow the diagnostic and resolution procedures for Node Failure on page 41 .

If the output of the command does not contain a line similar to above, or indicates no alarms are active (as below), then the symptoms are

occurring because of environmental issues or software faults.

No active alarms

Overload conditions are likely to cause lock timeouts. CPU usage should never exceed 70% for an extended period of time. CPU usage can be

measured on Linux using the top command.

Garbage collection, particularly under overload conditions can cause lock timeouts. The behaviour of the garbage collector can be analysed by

examining the garbage collection logs. By default these are logged to the Rhino console or the console.log file in

node-???/work/log For further information on diagnosing GC related problems, please refer to Java Virtual Machine Heap Issues on page

30 and Application or Resource Adaptor Heap Issues on page 32 .

SLEE load can also be investigated with the use of the Statistics Client. For more information about usage of the Statistics Client, please see the

rhino-stats section in the Rhino Administration and Deployment Guide. Of particular value are the time statistics for the lock manager, MemDB and

Savanna subsystems. If any of these report times in the 10s of milliseconds or greater this is usually a symptom of overload.

To resolve problems caused by overload it is usually necessary to add more cluster nodes on new hardware, correct the configuration of the

software or improve the efficiency of the service logic. Follow the diagnostic steps in the following sections Operating Environment Issues on page

28 , Java Virtual Machine Heap Issues on page 30 , and Application or Resource Adaptor Heap Issues on page 32 . If this does not lead

to a resolution contact your solution provider for assistance.

20

Page 21: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

If a lock timeout is caused by overloading the cluster, then an interim resolution is to enable a user-specified rate limiter to limit the number of

events processed by the cluster. Instructions for doing so are in the Rhino Administration and Deployment Guide .

Threshold rules should be set up to alert the SLEE administrator to adverse conditions within the SLEE.

Deadlock, lost transactions and stuck threads are typically caused by software bugs. These can be in Rhino, service logic, or RAs, and will often

be unrelated to call rate. To assist in diagnosing these problems observe the statistics parameter sets StagingThreads and Events. Diagnosing

the cause of these will usually require setting the log keys transaction.manager and transaction.instance to the debug level. The logs

obtained should be sent to your solution provider.

This writes a lot of log data and can impact service performance.

2.4 I/O exceptions

There are several causes for I/O exceptions to occur in Rhino logs. Below are the diagnostic steps and workarounds or resolution for I/O

exceptions from:

• Savanna on page 22

• Rhino’s logging infrastructure on page 22

• Rhino’s core on page 22

• resource adaptors or SBBs on page 23

For all diagnostic steps and workarounds or resolution first check:

1. There is sufficient disk space on the running system

2. Executing processes have write permissions to files in their subdirectories

3. The ulimit for the number of open file descriptors is not too low, depending on the protocols in use Rhino may require a higher open

files ulimit than the OS default.

21

Page 22: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2.4.1 I/O exceptions from Savanna in Rhino logs

2.4.2 Symptoms

The log key for these I/O exceptions contain any of the following phrases:

• savanna

• framework

• ocio

2.4.3 Diagnostic steps and correction

These messages typically occur if the network is unable to accept multicast traffic from Rhino. This can be caused by either inappropriate kernel

IP routing configuration or problems with network elements such as routers, switches, ethernet cards, firewalls etc. Start diagnosis from Operating

Environment Issues on page 28 . If this does not produce a solution continue with b Clustering on page . Finally contact your solution

provider for support.

2.4.4 I/O exceptions from Rhino’s logging infrastructure in Rhino logs

2.4.5 Symptoms

The log key for these I/O exceptions contains the following phrases:

• rhino.logging

• log4j.*

2.4.6 Workaround or resolution

These messages typically occur if the Rhino installation does not have appropriate disk space and/or permissions available or there is a hardware

error.

2.4.7 I/O exceptions from Rhino’s core in Rhino logs

2.4.8 Symptoms

The log key for these I/O exceptions contains the following phrases:

22

Page 23: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

• rhino.*

• memdb

• transaction.*

2.4.9 Workaround or resolution

These messages should be provided to the solution provider unless they clearly indicate the cause.

2.4.10 I/O exceptions from resource adaptors or SBBs in Rhino logs

If the log messages do not occur on any of the log keys mentioned in the cases, then they are likely to come from a resource adaptor or SBB.

These messages should be provided to the solution provider and possibly the resource adaptor vendor.

2.5 Inactive RA provider

2.5.1 Symptoms

A service fails with an error that looks like:

IllegalStateExceptionjava.lang.IllegalStateException: sendQuery called on inactive RA provider

2.5.2 Resolution

The above exception indicates that you need to activate the RA entity. Documentation on managing RA entities can be found in the Rhino

Administration and Deployment Guide .

2.6 Leaked OCIO buffers

2.6.1 Symptoms

Rhino logs contain the message:

23

Page 24: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12201 leaked OCIO buffers2015-04-15 22:54:30.794 INFO [savanna.ocio.leaks] <OCIO Cleanup Thread> Garbage-collected 12301 leaked OCIO buffers

2.6.2 Diagnostic steps

Compare the times these warnings are logged with garbage collection cycles. Cleanup of leaked OCIO buffers happens during CMS GC cycles

and all G1 collections, and it writes one log entry per 100 buffers cleaned up. The log entry reports the accumulated total of all OCIO buffers

cleaned up since the node started. If this message is occurring on every GC cycle, that may indicate an environment-related performance

problem.

Continue diagnosis of the performance problem, starting at Operating Environment Issues on page 28 .

If there are no signs of overload, then the warning is harmless and indicates that a section of code in the Rhino clustering implementation has

cleaned up internal objects that would have otherwise leaked.

2.6.3 Profile Management and/or provisioning failing

Symptoms

Creating or importing profiles fails.

Diagnostic steps

If profile management and/or provisioning commands are unsuccessful this is possibly due to the size restriction of the ProfileDatabase installed in

Rhino. The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.ProfileDatabase

Check the Rhino log for messages with a log key profile.* similar to the following:

2016-12-01 13:41:57.081 WARN [profile.mbean] <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:javax.slee.management.ManagementException: Cannot commit transaction at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36) at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28)

24

Page 25: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127) at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56) at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427)

25

Page 26: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33) ... 48 more

Resolution

Increase the size of the profile database as described in Memory Database Full on page 66

2.7 License over capacity alarms

If this occurs please contact your solution provider to purchase additional capacity licenses.

26

Page 27: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2.8 Rhino logs containing exceptions

If the exception is of the type AccessControlException , please refer to section Security Related Exceptions on page 63 to solve the issue.

If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the

load. Follow the diagnostic and resolution steps in Memory Database Full on page 71 .

For other exceptions please contact your solution provider for support.

2.8.1 Resource Adaptors throwing Exceptions

If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a

faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this

is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that

contain the exception message.

27

Page 28: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

3 Environmental

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for environmental issues with Rhino.

3.1 Operating environment issues

There are several common cases where Rhino’s operating environment is not appropriately configured. These are typically related to resource

competition or poor sizing for a given load. The most common causes are resource competition for CPU or heavy disk I/O.

3.1.1 Symptoms• Slow event response times

• Low event throughput rate

• Dropped calls

• Periodic failures

3.1.2 Diagnostic steps

Operating system utilities such as vmstat , mpstat , and iostat are particularly helpful in determining whether or not a problem is caused by

resource contention. Run such tools for enough time to observe the problem, and then review the logged output immediately before and after. Also

look for CPU, I/O, swap, and network use. You may also want to record these continuously, for analysis of long-term system performance and

transient faults.

If possible, collect system statistics no less frequently than 10s intervals, and cover every major subsystem. The UNIX standard SAR monitor also

collects a useful data; and other tools may offer higher resolution or better analysis support.

Some of the specific things to look for are listed below:

Please ensure that the system is sized such that no more than around 75% of CPU is used by Rhino under peak load with one failed node. If the

machine is above this utilisation of CPU, then the machine has been incorrectly sized. If running a virtualised environment, this should be scaled to

allow for failure of one physical host.

28

Page 29: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

If other processes running on the same hardware compete for CPU for long enough, the responsiveness of a Rhino node may suffer. In our

experience, operating system scheduling frequently does not provide Rhino with enough resources if there is competition for those resources.

Rhino should be installed on dedicated hardware with no competing applications. Processes that can be especially demanding of CPU and I/O

include databases and other Java servers.

Cron jobs can cause other system processes to run which may compete for system resources. Cron jobs which are not essential should be

disabled. Particular note should be taken of the locate service which will rebuild its database as a daily or weekly task, an I/O intensive

procedure.

CPU use of the machine should be monitored to ensure that there is more than about 25% of CPU available at all times, including with one node

failed. Ideally data should be collected no less frequently than every 10s, preferably with 1s intervals.

High levels of disk I/O can result in increased CPU use, and can cause the Rhino JVM process to be scheduled infrequently. This can also cause

writes to be scheduled synchronously, which may block threads during critical tasks such as GC. This is particularly likely on highly contended

disks such as found in virtualised environments and SAN storage arrays.

Swap use is often the cause of disk I/O. Rhino should be installed on a machine with sufficient physical RAM for the workload of the machine.

Swap will also significantly impair GC performance. If even a small amount of the JVM heap is swapped out, GC pause times can increase by up

to 300% and potentially much longer for greater swap usage. While the CMS collector will usually touch pages frequently enough to keep them in

physical memory, the G1 collector can leave regions untouched for long enough to be swapped out preemptively on Linux.

Rhino’s cluster membership services use multicast UDP. If the IP network is loaded heavily, this can cause packets to be dropped, potentially

causing cluster segmentation or failure if individual cluster nodes cannot reliably communicate.

See Cluster Segmentation on page 44 , Cluster Failing to Start on page 45 and Cluster Starts but Stops After a few Minutes on page 47

for more details of multicast related problems.

Large systems (that is, four or more sockets per host have significant asymmetry in memory access speed and latency between CPU sockets.

This can cause performance to be significantly less than optimum.

3.1.3 Workaround or resolution

If any of the machines Rhino is running on are running over 75% CPU use, and this is not due to other system processes, then consider:

• configuring a use rate limiter as a workaround

29

Page 30: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

• increasing the number of available threads

• installing more CPUs into a node

• installing more nodes into a cluster

• using faster CPUs.

Processes causing a problem due to resource usage should not be used on a machine running Rhino. Move these to another host.

To prevent preemptive swapping slowing GC, reduce the value of the sysctl vm.swappiness from the default of 60 to something low enough

to prevent this. 30 is a good starting point. If using G1 the vm.swappiness sysctl should be set no higher than 30 and preferably 10 or lower.

When running Rhino on NUMA hosts it is usually best to run multiple nodes on each host with each JVM bound to a separate NUMA domain.

Configure this with numactl --cpunodebind=X --membind=X on Linux or Locality Groups on Solaris.

3.2 Java Virtual Machine heap issues

Poor performance or failure of a Rhino node may result from a JVM configuration which has too little heap to suit the applications deployed and

incoming traffic processed by the application.

This section describes how to diagnose incorrect heap sizing. See Application or resource adaptor heap issues on page 32 to diagnose

application or resource adaptor memory leaks.

3.2.1 Symptoms• Large garbage collection pauses in garbage collection logs

• Longer than expected message processing and/or call setup times

• Failed or timed-out call setup or message processing attempts

• Out of memory error appearing in Rhino logs

• Watchdog errors "Over memory use limit"

3.2.2 Diagnostic steps

There are two common causes of heap issues when running Rhino:

• heap sizing may be incorrect

30

Page 31: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

• something may be leaking memory.

For the latter, see Application or resource adaptor heap issues on page 32 .

The heap usage of Rhino should be monitored. This includes both the Java heap and the native heap. The Java heap can be measured by

Rhino’s statistics tool rhino-stats , or one of the Java monitoring tools — jstat or jvisualvm . The native heap size can be measured using

system utilities such as ps or top .

Here’s an example of using the Rhino statistics tool to monitor the size of the Java heap:

./client/bin/rhino-stats -m JVM

The behaviour of the JVM process is that a well-sized heap should gradually decrease from, say, 1.5 gigabytes free to around, say, 800

megabytes free, and then increase up to its higher limit again. This will be overlaid by a shorter pattern of peaks of say 128 megabytes over the

long pattern usage. The actual figures will depend on the services that are active and deployed; and the total heap used should never exceed 60%

of the heap size allocated for sustained periods when using the default CMS collector.

If the JVM heap is sized appropriately for a given load, and a system appears to run stably at a given load and gradually degrades over a period of

at least several hours, then it is possible that the JVM, a resource adaptor or Rhino has a memory leak.

Check the number of live activities and SBBs using the Rhino statistics monitoring tool. If the number of each are within expected values, then it is

likely there is a memory leak. If there are too many, it indicates there may be a bug in the component. A strong indicator of an RA or service bug is

activities that are more than ten times the age of the average call duration, or older than one whole day.

3.2.3 Workaround or resolution

First raise the maximum heap size available to the JVM by modifying the HEAP_SIZE variable in the config/config_variables file. Ensure

that there is enough physical memory on the machine such that the system will not swap if this extra memory is used by the JVM. Also set the

NEW_SIZE based on the duration and memory usage of a typical session. The Rhino node must be restarted for this change to take effect. See to

JVM documentation for more information on heap sizing and related overheads.

Tuning garbage collection settings may improve performance, but this is NOT RECOMMENDED without prolonged performance

testing, as certain combinations of garbage collection settings may severely affect the stability of a cluster and cause JVM crashes. If

you wish to explore this avenue:

31

Page 32: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

• contact your solution provider support or OpenCloud support for help• thoroughly load test the system for at least two weeks using the new garbage collection settings before deployment to

production• consider adding more RAM to a machine, or more machines to the cluster.

Alternative garbage collectors such as G1, which may improve worst-case performance at some cost to average-case but have their

own tuning requirements. G1 requires the use of Java 8 or newer to function reliably.

With some allocation patterns it may not be possible to find a heap size or new size that provides an acceptable balance between pause times and

durations for the target load. If this is the case the only solutions are to use a different cluster size or contact your solution provider for an improved

version of the service that uses memory more efficiently. This problem should only arise as a result of an error in capacity planning and testing.

A workaround that is sometimes available if the host server has sufficient free RAM is to increase the heap size until it is large enough that the

pauses can be avoided by restarting nodes during idle periods (however this is rarely possible).

Please contact your solution provider support to work through component memory leak issues.

3.3 Application or resource adaptor heap issues

A system with application or resource adaptor related heap issues displays several symptoms, which are similar to those of an inappropriately

sized JVM heap on page 30 .

3.3.1 Symptoms• Large garbage collection pauses in garbage collection logs

• Longer than expected message processing and/or call setup times

• Failed or timed-out call setup or message processing attempts

• Out of memory errors in Rhino logs

3.3.2 Diagnostic steps

When these symptoms occur it is often necessary to check both for application or resource adaptor related heap issues and for JVM heap issues

on page 30 .

32

Page 33: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Start the rhino-console, and query for activities that are older than expected for the system. For example, call activities typically last the duration of

a phone call; so query for activities that are older than the majority of phone calls (say one hour). To perform this query use the findactivities

command with -created-before as a flag.

Here’s an example of finding activities created before 2:30pm:

findactivities -created-before 14:30:00

You can also query for activities older than a specified age, for example:

findactivities -created-before 1h

The resulting number of activities should be a low number. If it is a number larger than, say, 30 (depending on your application) then there may be

an issue with some activities not being cleaned up properly by the service or resource adaptor. You should exclude ServiceActivities and

ProfileActivities from the set counted.

If the number of activities is small, test for the number of SBB entities in Rhino. SBB entities are removed when they are no longer in use; so if

you find a large number of SBB entities that were created a long time ago, it is likely that the service has a problem. The command to query SBB

entities is the findsbbs command.

Here’s an example of finding SBB entities for the SIP proxy service created before 2:30pm:

findsbbs -service SIP\ Proxy\ Service\ 1.5,\ Open\ Cloud -created-before 14:30:00

We also recommend that you use the statistics client to examine the memory usage and number of activities in the running SLEE:

$ client/bin/rhino-stats -m Activities2006-02-27 11:56:29.610 INFO [rhinostat] Connecting to localhost:11992006-02-27 11:56:33.952 INFO [rhinostat] Monitoring2006-02-27 11:56:34.957 INFO [rhinostat] Cluster has members [101]

Activities

time active dropped finished started----------------------- -----------------------------------2006-02-27 11:58:59.574 57 - - -

33

Page 34: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2006-02-27 11:59:00.612 57 0 19 192006-02-27 11:59:01.635 57 0 21 212006-02-27 11:59:02.657 57 0 19 192006-02-27 11:59:03.875 57 0 21 212006-02-27 11:59:05.033 57 0 20 202006-02-27 11:59:06.053 57 0 20 20

3.3.3 Workaround or resolution

This indicates an unhandled case in the Services or resource adaptors being used. Activities and SBB entities known to be invalid may be

removed through client/bin/rhino-console using the removeactivity and removesbb commands respectively. These commands may

be used to stop a system from running out of memory before a patched application or resource adaptor is installed.

3.4 Rhino start-up fails with 'java.io.IOException: Not Enough Space'

The most likely cause for this exception is that there is not enough free memory. When compiling deployable units, Rhino versions earlier than

2.3.0.7 fork off another JVM for running javac. This may double memory requirements during deployment. The exact behaviour depends on the

memory management of the operating system. Since version 2.3.0.7, Rhino no longer forks to run javac unless specifically configured to for

diagnostic purposes. The problem can still occur but is less likely.

3.4.1 Symptoms

INFO [rhino.management.deployment.builder] INFO [rhino.management.deployment.builder]common classes java.io.IOException: Not enough space<main> Generating profile implementation<main> Compiling generated profile specification at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:52) at java.lang.Runtime.exec Internal(Native Method)at java.lang.Runtime.exec(Runtime.java:566)at java.lang.Runtime.exec(Runtime.java:491)at java.lang.Runtime.exec(Runtime.java:457)WARN [rhino.management.deployment] <main> Installation of deployable unit failed:com.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1Failed to start rhino nodecom.opencloud.rhino.management.deployment.BuildException: javac failed - exit code -1

34

Page 35: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Node 101 exited due to misconfiguration.

3.4.2 Diagnostic steps

Firstly, check to see how much free memory is available to the JVMs, and that swap space is available. Rhino should never run using swap space

(doing so will cause unpredictable performance); but nonetheless, sufficient swap space should be present to allow the OS to manage memory

more effectively.

Check the amount of physical memory available. Under Linux, use the free command; under Solaris use /usr/sbin/prtconf \grep

Memory .

Compare this amount to the size of the JVM specified in the HEAP_SIZE variable in node-???/config/config_variables (for Rhino-

production). Make sure that the JVM can fit comfortably in the physical memory on the computer.

3.4.3 Workaround or resolution

Rhino versions earlier than 2.3.0.7 require about twice the amount of physical memory as the size of HEAP_SIZE to ensure that a fork() of the

JVM during compilation of services will always be able to run in physical memory and not incur delays by swapping memory to and from disk.

Linux systems use copy-on-write memory cloning when forking processes so the likelihood of swapping is much lower than on Solaris.

You may need to reduce the value of HEAP_SIZE by editing the file $RHINO_HOME/node-???/config/config_variables to make sure

Rhino can fit in the available physical memory.

Also make sure there are no other running processes on the computer which may be using excessive system resources, and that there is

sufficient swap space.

3.5 Warning about UDP buffer sizes

Rhino cluster communication requires large UDP send and receive buffers.

The operating system limits for socket transmit and receive buffers must be large enough to allow the buffer size to be set.

35

Page 36: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

3.5.1 Symptoms

Rhino prints warnings at startup:

WARN [savanna.ocio.socket.sc] <main> Asked for a UDP receive buffer of 261120 bytes, but only got 212992 bytes. Check system-wide limits (net.core.rmem_max, udp_max_buf)

3.5.2 Resolution

Ensure that the kernel parameters net.core.rmem_max and net.core.wmem_max are large enough.

See UDP buffer sizes in the Preparing your Network section of the Rhino Production Getting Started Guide for details.

3.6 Java Virtual Machine error

The Sun Java Virtual Machine is typically very stable and, theoretically, should not crash even under high load. On rare occasions, however, the

JVM can crash for any of the following reasons:

• Memory on the local computer has been corrupted.

• The swap partition on the local disk has had data corruption.

• Some other aspect of the hardware in the local computer has caused a JVM crash.

• A bug or library versioning issue in the operating system has caused the JVM to crash.

• A bug in one of the included Java Native Interface (JNI) libraries has caused the JVM to crash.

• A bug in the JVM itself has caused it to crash.

3.6.1 Symptoms

When the JVM crashes, the Rhino node will fail and the cluster will reconfigure itself. On Unix systems, the node will die with either a SIGSEGV or

a SIGBUS and perhaps dump core, depending on how your system is configured.

A message similar to the following appearing in logs and/or the Rhino console indicates a JVM error.

## An unexpected error has been detected by HotSpot Virtual Machine: ## SIGSEGV (0xb) at pc=0xf968c5f9, pid=27897, tid=64 #

36

Page 37: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

# Java VM: Java HotSpot(TM) Server VM (1.5.0_05-b05 mixed mode)# Problematic frame:# J com.opencloud.slee.resources.signalware.OutgoingBackendConnection.writePacket (Lcom/opencloud/slee/resources/signalware/PacketType;Lcom/opencloud/slee/resources/psc/Encodeable;)V #### An error report file with more information is saved as hs_err_pid27897.log# If you would like to submit a bug report, please visit: http://java.sun.com/webapps/bugreport/crash.jsp##27897 Abort - core dumped

The following are examples of configuration change messages.

...2016-04-27 12:16:24.562 INFO [rhino.main] <SavannaDelivery/domain-0-rhino-db-stripe-2> Group membership change: GrpCCM: domain-0-rhino-db-stripe-2 Reg [101,18108] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }2016-04-27 12:16:24.569 INFO [rhino.main] <SavannaDelivery/rhino-monitoring> Group membership change: GrpCCM: rhino-monitoring Reg [101,15192] - Memb [ 101, 102, 103, 104, 105, 106, 107 ] { BboneCCM: Reg [101,27068] - Memb [ 101, 102, 103, 104, 105, 106, 107, 200 ] }2016-04-27 12:16:24.601 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node(s) [108] left cluster operational member set2016-04-27 12:16:24.601 INFO [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Current membership set: [101,102,103,104,105,106,107,200]2016-04-27 12:16:24.619 INFO [framework.scs] <SavannaDelivery/rhino-management> [rhino-management] Component change: Component(members=[101,102,103,104,105,106,107],transitioning=[101,102,103,104,105,106,107],nonPrimary=[],id=5222)...

2016-04-27 12:22:39.041 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node has left primary component2016-04-27 12:22:39.041 INFO [rhino.exithandler] <SavannaDelivery/rhino-cluster> Exiting process (Node left primary component)

An alarm appears in the console if a node has failed. Use this command to query active alarms:

./client/bin/rhino-console -listActiveAlarms

37

Page 38: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

This command produces a block of output for each active alarm. A block in the list of active alarms similar to the following indicates that a node

has left the cluster due to hardware, network, or software failure.

...Alarm 101:193861858302463 [rhino.node-failure] Rhino node : 101 Level : Major InstanceID : 102 Source : (Subsystem) ClusterStateListener Timestamp : 20161103 17:44:02 (active 0m 6s) Message : Node 102 has left the cluster...

3.6.2 Diagnstic steps

If any of these symptoms occur, then first check the logs of all Rhino cluster members (file console.log ). If a JVM error message is part of

the logs of any node, then a Java Virtual Machine error has occured. If an error message cannot be found in the logs please refer to Cluster

Segmentation on page 44 , Cluster Failing to Start on page 45 and Cluster Starts but Stops After a few Minutes on page 47

After checking Rhino’s logs, determine whether or not the hardware or operating system is causing the problem. Look at the logs of the local

machine and determine whether the machine has had a history of restarts, kernel panics, process segmentation faults, and so forth. On Unix

machines, system logs can be viewed using the dmesg command or by viewing logs in /var/logs .

When a JVM crash occurs, a file will be left with a name like hs_err_pid*.log . View this file and try to determine which part of the JVM caused

the crash. This may provide clues as to how to resolve the situation and will be needed if a bug in the JVM is to be reported.

3.6.3 Workaround or resolution

If the crash appears to be a one-off, the node can simply be restarted.

If the problem resides in faulty hardware, then that hardware will need to be fixed or replaced.

OpenCloud has experience with configuring the Sun Java Virtual Machine to be as stable as possible. Consider consulting OpenCloud support

for information regarding known stable configurations of JVMs. Also ensure that thorough stability tests are performed on a new version of a Java

Virtual Machine before it is deployed in a production system.

38

Page 39: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Also scan through the list of dynamic libraries in the hs_err_pid*.log file and see if they are all the latest version. Keep an eye out for libraries

which should not be there, for example from other installations of older JVMs.

If it is a realistic option, upgrading the operating system to the latest version may also resolve this problem.

JVM errors that cannot be identified as being caused by system configuration or dynamic library problems should be reported to Oracle.

3.7 Multicast traffic is using the wrong network interface

3.7.1 Symptoms

The host machines for Rhino have multiple network interfaces, and we need to control which interface is used for SLEE cluster traffic. Network

monitoring shows this is currently not on the desired interface.

When the configuration differs between hosts, nodes may fail to go primary. This is indicated by repeated messages in the Rhino log indicating

that a node is waiting for the cluster to go primary and alack of configuration change messages.

Messages like this show up in the Rhino console and logs:

...INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []...

3.7.2 Diagnostic steps

Use a network monitoring tcpdump or snoop to detect multicast traffic on the groups configured for Rhino clustering.

Use netstat -rn , route , or ip route to examine the current multicast routing configuration.

39

Page 40: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

3.7.3 Workaround or resolution

The choice of interface is controlled solely by the machines' multicast configuration; it’s controlled at the OS level rather than by Rhino. Use

route or ip route to modify the OS routing configuration. See Cluster Failing to Start on page 45 for how to configure the interface used for

multicast traffic.

40

Page 41: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4 Clustering

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for clustering issues with Rhino.

4.1 Node failure

Rhino is designed to survive hardware and software failure by using a redundant cluster of nodes. There are many possible causes for node

failure; below are symptoms and diagnostic steps for the most common.

In most deployments Rhino is configured to automatically restart failed nodes if the failure was caused by network or software faults. In all cases

not caused by hardware faults contact your solution provider support, providing all the rhino.log and console.log files from the failed Rhino node.

Restart the failed node if it did not restart automatically.

4.1.1 Configuration change messages appear in Rhino log output.

Symptoms

A Rhino node may determine that another node has left the primary component (due to failure of hardware, network, or software). When this

happens, the output looks like this:

...WARN [rhino.membership] Node(s) [102] left cluster admin operational member setINFO [rhino.membership] Current membership set: [101]...

Diagnosis and Resolution

Please see Cluster Segmentation on page 44 .

4.1.2 An alarm indicates a node has failed.

Symptoms

An alarm appears in the console if a node has failed. To query active alarms run this command:

41

Page 42: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

./client/bin/rhino-console -listActiveAlarms

This produces one line of log output for each active alarm. If the list of active alarms contains a line similar to the following, it indicates that a node

has left the cluster due to hardware, network, or software failure.

...Alarm ...(Node 101, 23-Nov-05 15:10:36.636): Major [rhino.cluster] Node 102 has left the cluster ......

Diagnosis and Resolution

Please see Cluster Segmentation on page 44 .

4.1.3 A Rhino node exits the JVM.

Symptoms

A Rhino node may determine it is not in the cluster majority after a cluster segmentation. If this is the case it terminates.

An example of the log output from the terminated node is as follows.

INFO [rhino.main] Cluster backbone membership change: BboneCCM: Trans [103,30] - Memb [ 103 ]INFO [rhino.main] Cluster backbone membership change: BboneCCM: Reg [103,32] - Memb [ 103 ]...WARN [rhino.membership] Cluster admin group has left primary component - exiting processWARN [rhino.starter.loader] Node is no longer ready - exiting process

For more information please see Cluster Segmentation on page 44 .

4.1.4 Out of memory errors

Symptom

Out of memory errors in the console log (console.log).

42

Page 43: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Diagnosis and Resolution

Please refer to Java Virtual Machine Heap Issues on page 30 and Application or Resource Adaptor Heap Issues on page 32 .

4.1.5 JVM errors.

A message in the Rhino console log like the following indicates a JVM error.

## A fatal error has been detected by the Java Runtime Environment:## SIGSEGV (0xb) at pc=0x00007f9c742ffe25, pid=16224, tid=0x00007f9c720c6700## JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)# Problematic frame:# V [libjvm.so+0x5c2e25] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45## Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again## If you would like to submit a bug report, please visit:# http://bugreport.java.com/bugreport/crash.jsp#...

If such a message is present, please see Java Virtual Machine Error on page 36 .

4.1.6 Watchdog timeouts.

Symptoms

A message like the following in the Rhino logs indicates a watchdog timeout.

2005-09-06 10:18:13.484 ERROR [watchdog] Waiting for thread cleanup.....2005-09-06 10:18:14.494 ERROR [watchdog] ***** KILLING NODE *****

43

Page 44: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Diagnosis and Resolution

If such a message is present please see Rhino Watchdog on page 14 .

4.2 Cluster segmentation

Cluster segmentation occurs when part of a Rhino cluster determines that a cluster node or a set of nodes is no longer reachable. Nodes that

determine that they no longer make up a majority of the cluster (known as a quorum ) automatically deactivate themselves.

Cluster segmentation can be caused by several things, the most common being a network failure such as a switch malfunctioning.

4.2.1 Symptoms• Configuration change messages appear in Rhino log output

• A Rhino node exits the JVM

• An alarm indicating a node has failed

• Watchdog timeouts

4.2.2 Diagnostic steps

First determine whether or not the segmentation is environmental. This means that the operating environment (see Operating Environment Issues

on page 28 ) and the JVM Heap (see Java Virtual Machine Heap Issues on page 30 ) must be suitably configured. If these are configured suitably

then the issue is likely to something with the network which interconnects the Rhino nodes. Typical network components to check for failure are

machines, network interface cards, routers, switches, and cables.

See Node failure on page 41 for other causes of node failure.

4.2.3 Workaround or resolution

If the cause of the symptoms are environmental, then refer to the workaround or resolution steps for these issues. If the cause is a network issue,

then it needs to be repaired by replacing the failed components.

44

Page 45: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.3 Cluster failing to start

Rhino nodes must be part of the “primary component” in order to perform any work. The primary component is the set of nodes sharing the same

cluster state. When booting up, nodes will wait to join the primary component before being able to perform work. This section describes the likely

issues which cause a node to wait on the primary component for an extended period of time (greater than several seconds).

If the following diagnostic steps do not resolve the problem, please ensure that the various cluster machines can reach each other. For example,

the ping command should indicate whether or not nodes can reach each other. If nodes cannot ping each other, then the cause is likely to be

network misconfiguration or hardware failure.

4.3.1 No Primary Component

Symptoms

Repeated messages in the Rhino log indicating that a node is waiting for the cluster to go primary.

Messages like this show up in the Rhino console and logs:

...INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []INFO [rhino.main] Waiting for cluster administration group to go primary; current primary members: []...

Resolution

If this is occuring on all nodes, run the make-primary.sh script on one node.

See the Rhino Production Getting Started Guide for more information about primary components.

45

Page 46: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.3.2 Multicast Configuration Error

Symptoms

No configuration change messages in the Rhino logs.

A lack of configuration change messages in a clustered setup indicates that the Rhino node is not able to communicate with other Rhino nodes in

the cluster.

Diagnostic and corrective steps This is caused by an inability for nodes to communicate. Diagnosis and correction depends on whether the cluster

communication mode is Multicast or Scattercast.

Ensure that a route for the multicast IP addresses being used by Rhino is present in the operating system.

Use the route and netstat system tools for more information.

Ensure that any routers are configured to allow multicast IP packets with the configured multicast addresses to be passed to the appropriate

machines.

Ensure that the following variables in the config/config_variables files on all Rhino nodes which are part of a cluster match:

SAVANNA_CLUSTER_IDSAVANNA_CLUSTER_ADDRSAVANNA_MCAST_STARTSAVANNA_MCAST_END

Check that the host firewall is not configured to block the address/port range Savanna uses on the interface used for multicast. A simple test is to

remove all firewall rules with iptables -F . Rhino uses UDP multicast to a range of IP addresses and ports set in node-<NODE_ID>/config/

config_variables and node-<NODE_ID>/config/savanna/cluster.properties The IGMP membership groups 224.0.0.1 and

224.0.0.2 must also be allowed on this interface.

Verify whether or not multicast IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster

members, use tcpdump and/or snoop .

For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:

46

Page 47: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap

The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark .

Ensure that all scattercast endpoint addresses are routable from every host in the cluster. Check that the scattercast endpoints set is up to

date on nodes waiting to go primary. If the node is sufficiently out of date to not receive from any node, it cannot detect that it is out of date and

shutdown.

Check that the host firewall is not configured to block the address/port pairs and ranges used by Savanna for scattercast. A simple test is to

remove all firewall rules with iptables -F . Rhino uses UDP unicast to a range of IP addresses and ports set in:

node-<NODE_ID>/config/config_variablesnode-<NODE_ID>/config/savanna/cluster.propertiesnode-<NODE_ID>/config/savanna/scattercast.endpoints

Verify whether or not Savanna UDP IP messages are being passed from one machine to another correctly. To see the IP traffic between cluster

members, use tcpdump and/or snoop .

For example, to record packet traces for the network interface eth1 limiting to a maximum of two 500MB files:

tcpdump -C 500 -i eth1 -W 2 -w mcast_dump.pcap

The collected packet capture can be copied to the administrator’s workstation and opened with Wireshark .

4.4 Cluster starts but stops after a few minutes

This affects multicast only.

4.4.1 Symptoms• Configuration change messages appearing in Rhino log output

• Many or all Rhino nodes exit the JVM

For example:

47

Page 48: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-04-27 12:29:24.352 WARN [rhino.membership.rhino-cluster] <SavannaDelivery/rhino-cluster> Node has left primary component

• Multiple alarms indicating a node has failed

• Watchdog failures for the condition "GroupHeartbeat"

For example:

2016-04-27 12:29:24.207 ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: GroupHeartbeat for group rhino-monitoring (sent=33 received=23)

4.4.2 Diagnostic steps

Check the current set of registered multicast groups with netstat -gn . These should include all the groups used by Savanna and the global

IGMP membership reporting address 224.0.0.1 . Compare the reported set between nodes and with the set present immediately after a node

has started.

Check for "Martians" in syslog. Sometimes a peer will be a device on the multicast network that is in a different IP range from the hosts.

To enable logging of these on Linux run:

echo 1 > /proc/sys/net/ipv4/conf/all/log_martians

When multicast networking is in use a network device, commonly the switch or router, acts as an IGMP querier, sending regular messages to

test for active multicast devices. The OS kernel receives and responds to these queries with membership reports. If no membership reports are

received before the timeout configured on the querier it will stop forwarding multicast traffic.

To report IGMP traffic (Multicast group membership queries and reports) for all network interfaces on a host:

tcpdump -i any igmp

Use tcpdump or snoop to capture IGMP traffic on the interface used for multicast. Look for Query messages and Membership Reports.

48

Page 49: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.4.3 Workaround or resolution

The IGMP membership addresses 224.0.0.1 and 224.0.0.2 must be routed on the interface used for Savanna clustering as well as the

configured Savanna group multicast addresses.

Configure the local switch to act as an IGMP querier. Especially important for virtualised environments

Sometimes you need to force the kernel to fall back to IGMPv2. Usually this will be because the switch does not support IGMPv3. To configure

this, run:

/sbin/sysctl -w net.ipv4.conf.eth0.force_igmp_version=2`

Make this change permanent by adding net.ipv4.conf.eth0.force_igmp_version=2 to /etc/sysctl.conf

While uncommon for hosts in the cluster, switches are frequently in different IP ranges. The Linux kernel will drop these messages from

"impossible" addresses by default as a DOS prevention measure. Also some switches will send IGMP queries on the All-Hosts group instead of

the specific group Rhino joined. If your querier is on a different subnet from the nodes, disable the Martian packet safety check.

Fix this by configuring the rp_filter sysctl to accept packets from IPs that are not reachable from the cluster interface, by setting the

rp_filter sysctl to 0 or 2 if the source address is on a network visible from another interface.

/sbin/sysctl -w net.ipv4.conf.default.rp_filter=0/sbin/sysctl -w net.ipv4.conf.all.rp_filter=0

Make this change permanent by adding these to /etc/sysctl.conf :

net.ipv4.conf.default.rp_filter=0net.ipv4.conf.all.rp_filter=0

For more information on rp_filter and other kernel multicast configuration, see https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt .

You will also need to configure the firewall to accept IGMP packets sent to 224.0.0.1 and 224.0.0.2 if your rule does not include these

implicitly.

49

Page 50: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.5 Rhino SLEE fails to start cluster groups

Typically, the failure to start a cluster group indicates a network configuration problem.

Unless configured to use scattercast, Rhino distributes state between cluster members using UDP multicast. The particular multicast addresses

used are defined during installation. Multicast addresses are, by definition, in the range 224.0.0.0 through to 239.255.255.255 , otherwise

notated as 224.0.0.0/4 .

A failure will occur if Rhino tries to create a multicast socket on a non-multicast address. The Rhino install script checks to ensure that a valid

multicast address is used, so this would only occur if the Rhino configuration has been edited manually after installation.

It is possible to change the addresses used by the cluster manually, by editing the configuration on every node that has been created. Care should

be taken to ensure all nodes in the cluster are configured to use the same multicast addresses.

To change these addresses, edit the file $RHINO_HOME/node-101/config/config_variables .

4.5.1 Symptoms

Output is usually something like this:

ERROR [savanna.stack.membership.manager] Failed to start cluster groupsjava.lang.RuntimeException: Failed to change addresses to /224.0.24.0:45601

4.5.2 Diagnostic steps

If the cluster is configured to use scattercast follow the diagnostic procedures for Cluster Failing to Start on page 45 and Scattercast endpoints

out of sync on page 52 . For multicast clusters follow the steps below:

The install script checks to ensure that a valid multicast range is used but this may be changed manually afterward. If the address reported is in

the multicast range described above, check that your OS kernel supports multicast:

cat /boot/config-<kernel version> \grep CONFIG_IP_MULTICAST

50

Page 51: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

4.5.3 Workaround or resolution

If the configured addresses are not all within the multicast range change the configuration to use only multicast addresses.

If the configured UDP multicast range is not used by other applications, it should work fine for Rhino. Failing that, any address range can be used

in 224.0.0.0/4 .

To change the addresses without reinstalling, edit the file $RHINO_BASE/node-101/config/config_variables and change the following

lines to use multicast addresses not taken by other programs:

SAVANNA_CLUSTER_ADDR=224.0.24.1SAVANNA_MCAST_START=224.0.24.1SAVANNA_MCAST_END=224.0.24.8

The master copy of this file (used when creating additional nodes) is $RHINO_BASE/etc/defaults/config/config_variables . If several

nodes are to be deployed, this can be changed before nodes are created, to save extra editing at a later stage.

4.6 Group heartbeat timeout

Rhino uses a heartbeat mechanism to ensure Savanna groups are alive and functioning correctly. This can cause a node to shut down when it

detects abnormal group behaviour.

4.6.1 Symptoms

Output is something like this:

ERROR [watchdog] <Global watchdog thread> Watchdog Failure has been notified to all listeners because condition 'GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)' failedERROR [watchdog] <Global watchdog thread> *** WATCHDOG TIMEOUT ***ERROR [watchdog] <Global watchdog thread> Failed watchdog condition: GroupHeartbeat for group domain-0-rhino-db-stripe-0 (sent=20 received=10)ERROR [watchdog] <Global watchdog thread> ***** KILLING NODE *****

Diagnostic and corrective stepsThis is almost always caused by a networking issue that does not affect the backbone Savanna group used for cluster membership.

51

Page 52: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Diagnosis and correction is therefore identical to bxref:#cluster-failing-to-start[Cluster Failing to Start].

Also see Rhino Watchdog on page 14 .

4.7 Scattercast endpoints out of sync

This issue can only occur on a cluster using scattercast communications mode. Scattercast does not permit autodiscovery of nodes. Therefore

each node must a priori know the addresses of all nodes in the cluster. Whenever the persistent set of known endpoints gets out of sync on a

node, it will not be able to successfully join the cluster.

4.7.1 Symptoms

This problem can manifest in several ways, depending on how the scattercast endpoints are out of date. One example is shown below. Almost all

examples of scattercast endpoints out of sync will result in the out of date nodes shutting down.

ERROR [savanna.stack.membership.manager] <SavannaIO/pool-thread-1> Halting due to scattercast endpoint mismatch detection:Network advertised version: 2 does not match local version: 1ERROR [savanna.stack.membership.manager] <SavannaIO/pool-thread-1> An up to date scattercast.endpoints file can be found on any running nodeINFO [rhino.exithandler] <SavannaIO/pool-thread-1> Exiting process (Exiting due to fatal Savanna misconfiguration or conflict)

It is possible to create a situation where a node will boot with an out-of-date persistent endpoints set, and fail to shut down. If booted after a

scattercast management command removes it from the endpoints set, a node will not go primary nor shut down.

This will trigger frequent cluster membership changes in the remaining nodes. The cluster membership change messages will not change the set

of members.

INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,10] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,12] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,14] - Memb [ 101 ]

52

Page 53: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,16] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Trans [101,18] - Memb [ 101 ]INFO [rhino.main] <SavannaIO/pool-thread-1> Cluster backbone membership change: BboneCCM: Reg [101,20] - Memb [ 101 ]

Diagnostic and corrective steps If some nodes are operational, run the Rhino console command getscattercastendpoints to learn the

current cluster configuration. This will print the live in-memory configuration; and if the scattercast.endpoints file on any running node

contains a different configuration, it will print the memory state for each node and the differences from the disk state.

If all nodes have the same configuration, the output will be similar to the example below:

[Rhino@host (#0)] getscattercastendpoints[Consensus] Disk Mapping : Coherent[Consensus] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010

If the node configuration is consistent but does not contain the expected set of nodes, then the missing nodes should be added by running the

addscattercastendpoints command.

If the disk mapping differs from the in-memory configuration, then the nodes with the differing configuration are likely to fail when restarted. An

example output where node 101 has a malformed file that will cause a node to fail to reboot is below:

53

Page 54: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

[Rhino@host (#0)] getscattercastendpoints[101] Disk Mapping : 192.168.4.103:18001 present twice in endpoints file[101] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010

[102] Disk Mapping : Coherent[102] Memory Mapping : [101] Address : 192.168.4.103:18000 [102] Address : 192.168.4.103:18001 [103] Address : 192.168.4.103:18002 [104] Address : 192.168.4.103:18003 [105] Address : 192.168.4.103:18004 [106] Address : 192.168.4.105:18005 [107] Address : 192.168.4.105:18006 [108] Address : 192.168.4.105:18007 [109] Address : 192.168.4.105:18008 [110] Address : 192.168.4.105:18009 [111] Address : 192.168.4.105:18010...

If some nodes have a different configuration on disk from the live memory configuration, then copy the scattercast.endpoints file from a

node that has matching disk and memory configurations.

54

Page 55: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

See also the Repair section of Scattercast Management in the Rhino Administration and Deployment Guide.

55

Page 56: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

5 Performance

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino performance issues.

5.1 High Latency

High latency can be observed in services running on the Rhino SLEE if the system is not configured appropriately for the workload or if a resource

leak is present. Services should always be performance tested before deployment however the tested load profile may not match the production

load so sometimes live diagnosis is required. Always attempt to reproduce performance problems in a test environment before trying solutions on

a live system unless the fix is both simple and obvious e.g. a faster disk or more nodes.

5.1.1 Symptoms• Long Garbage Collection pauses in Garbage Collection/console logs

• Longer than expected processing and/or response times

• Failed or timed out attempts in your application

5.1.2 Diagnostic steps1. Firstly, determine whether or not the operating environment is the cause of the problem. For more information on the operating

environment refer to Operating Environment Issues on page 28

2. Secondly, determine whether or not heap use is the cause of the problem. This should be considered the most likely cause when

garbage collection pauses are long. For more information on heap use refer to sections Java Virtual Machine Heap Issues on page

30 and Application or Resource Adaptor Heap Issues on page 32 .

The following steps are to be followed if the previous two stages of diagnosis were unsuccessful. All three steps use the output from client/

bin/rhino-stats monitoring the Staging Threads parameter set.

The output for the Staging Threads should typically be monitored over at least a 30 minute period in order to ensure that the system

characteristics are noticed through different garbage collection cycles.

An example command which connects to a Rhino cluster member on the local host and monitors the Staging Threads parameter set is as follows.

56

Page 57: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

./client/bin/rhino-stats -h localhost -m Staging\ Threads

Low number of available threads in Rhino statistics output

The number of available threads is in the avail column of the rhino-stats output. Typically in a system which has too few threads configured the

number of available threads is close to the maximum number of threads however it may drop to a near zero value for a brief period of time (half a

second to several seconds).

High staging queue size in Rhino statistics output

The staging queue is used by Rhino to store incoming events when there are no free threads available to process the event. In a well sized

system the staging queue is almost always empty. This can be observed by looking at the queueSize column. This should be a low number, less

than the number of available threads. If it is not try doubling the number of threads available to Rhino.

Dropped staging items

Dropped staged items means that queued events have been discarded due to lack of queue size to hold the item. If this value is ever non-zero

it indicates that events are not being processed fast enough. This can indicate that the system is configured inappropriately with respect to the

number of threads or that a Service or Resource Adaptor in Rhino has an issue. Overload conditions can also cause cause dropped staging items.

High event processing time

Event processing time should be monitored for each event type that makes up the application flows, particularly those flows showing poor

performance e.g.

./client/bin/rhino-stats -h localhost -m "Events.ocsipra.[javax.sip.message.Request.INVITE net.java.slee, 1.2]"

The most important statistic in event processing is the SBB Processing Time (SBBT). This tells how long the thread processing took in the service

logic and is an indicator of the theoretical upper limit on how many events/s can be processed per thread. High SBB Processing Time statistics

indicate either too many staging threads, high CPU load or, more frequently, a service that has slow dependencies e.g. a database query.

The Event Router Time (ERT) statistic shows the Rhino overhead of processing the event including transaction commit time. This is dependent

on Rhino overheads, SBB initial-event-selectors and the SBB’s data storage activity. High Event Router Time statistics usually indicate either

57

Page 58: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

too many staging threads, high CPU load or a service that stores very large amounts of data (more than 1MB per transaction) in a the replicated

MemDB. It may also be a symptom of a Rhino bug.

Event Processing Time (EPT) is the equal to the sum of these and indicates the actual limit on per-thread event processing rate with the currently

deployed set of services.

5.1.3 Workaround or Resolution

If the diagnostic steps show that there are not enough threads available, try doubling the number of staging threads. This can be achieved using

the rhino-console command at runtime, or by editing the thread-count attribute of the staging queues element in config/config.xml and then

restarting that node.

Note that it is possible to set the number of staging threads too high, and in doing so cause other performance problems. We do not recommend

setting the number of staging threads higher than 200 on most systems or approximately 10x the number of CPU cores. Setting it too high will

cause scheduler thrashing which will increase latency and reduce throughput.

If the problem persists then this indicates there is a potential issue with the Services, and/or Resource Adaptors installed in Rhino.

In the case of performance problems with services or suspected bugs in Rhino please contact your solution provider for support. If possible the

problem should be reproduced in a test environment so additional logging for diagnosis does not impact live traffic.

5.2 Dropped Calls

Services running on Rhino have real-time commitments and latency for those services needs to be at an absolute minimum. “Dropped calls”

occur when an activity cannot be completed in an appropriate timeframe. This has many causes. The most common causes are described and

diagnostic steps for each is provided.

5.2.1 Symptoms• Longer than expected processing and/or call setup times

• Failed or timed out call setup attempts

• Timed out or dropped stage items

• Rhino logs and/or console output containing exceptions

58

Page 59: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

• Resource Adaptors throwing Exceptions

• Lock timeout messages in Rhino logs and/or console

• Rate limiting

• A dependant external system is not functioning properly

5.2.2 Diagnostic steps1. Check the Rhino log for exceptions. Exceptions in the logs often indicate a cause for dropped calls. If exceptions are found in the

logs, follow the steps in Rhino logs and/or console output containing exceptions on page 59

2. Follow the diagnostic steps defined in sections Operating Environment Issues on page 28 , Java Virtual Machine Heap Issues on

page 30 and Application or Resource Adaptor Heap Issues on page 32 and follow through diagnostic steps to determine whether or

not they are the cause.

3. Follow the diagnostic steps defined in section High Latency on page 56 .

5.2.3 Rhino logs containing exceptions

If the exception is of the type AccessControlException , please refer to section Security Related Exceptions on page 63 to solve the issue.

If the exceptions contain the text "Event router transaction failed to commit" it is likely that the in-memory database is not sized correctly for the

load. Follow the diagnostic and resolution steps in Memory Database Full on page 71 .

For other exceptions please contact your solution provider for support.

Resource Adaptors throwing Exceptions

If the log key for the exception log matches the tracer for a resource adaptor entity e.g. trace.sipra the cause of the problem is likely to be a

faulty resource adaptor. This is indicative of either a misconfiguration of the resource adaptors in question or a bug in their implementation. If this

is occurring frequently then please contact your solution provider support. Contact your solution provider for support attaching the log files that

contain the exception message.

5.2.4 Lock timeout messages in Rhino logs and/or console

If lock timeout messages are occuring please refer to section Lock Timeouts on page 18 . Two examples of lock timeout messages are as follows.

...

59

Page 60: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

... Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT ...

...

...

... ========= Lock timeout ========= ...

... Unable to acquire lock on [lockId] within 5000 ms ...

...

If lock timeout messages are occurring please refer to Lock Timeouts on page 18 .

5.2.5 Rate limiting

A rate limiter may be rejecting input if the incoming rate exceeds its specified rate.

The rate limiter raises an alarm and outputs to the Rhino logging system if it is actively rejecting input. An example of this is shown as follows.

2011-08-02 22:02:59.931 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 103:212775261552:14 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,queue-saturation-limiter-rejecting-work] was raised at 2011-08-02 22:02:59.931 to level Major QueueSaturation limiter is rejecting work2014-11-18 23:26:41.134 Major [rhino.facility.alarm.manager] <Timer-1> Alarm 101:316812519480:1 [SubsystemNotification[subsystem=ThresholdAlarms],LIMITING,system-input-limiter-rejecting-work] was raised at 2014-11-18 23:26:41.134 to level Major

A user rate limiter is designed to reject input above a certain threshold to avoid overloading Rhino. If this threshold is appropriately configured then

the messages should occur very rarely. If they occur frequently then it is possible that the solution is not sized appropriately for the incoming load.

In this case please contact your solution provider support.

5.2.6 A dependant external system is not functioning properly

This is difficult to diagnose in a general manner as it is specific to the solution running on Rhino. However the system administrator should be

aware of any external systems which the solution communicates with. Examples may include rating/charging engines, HLRs, HSS, MSCs, IVRs,

SMSCs, MMSCs, databases, etc.

Please refer to administrators of dependant systems and if necessary contact your solution provider support.

60

Page 61: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

5.3 A newly started node is unable to handle full traffic load

5.3.1 Symptoms• A new cluster or node has been brought into service and latencies are much greater than target

• Traffic to a new node fails with timeouts

5.3.2 Workaround or Resolution

Java takes some time to settle during application startup due to Just in Time(JIT) compilation of code hotspots. For this reason, we recommend

slowly introducing traffic to a newly booted node. Most of the hotspots will be optimised in the first minute, however it is better to ramp up load over

the course of up to 15 minutes to allow JIT compilation and optimization to occur.

Cluster size should be set during provisioning to allow sufficient reserve capacity to handle peak-hour traffic with at least one failed node. This

allows a lot of flexibility in bringing a failed node back into the cluster.

The recommended approach to bringing a node back into the cluster is to use a load balancing mechanism to limit traffic to the rebooted node.

(Signalware has this mechanism built in) Usually a smooth increase from zero to a fair share over the course of 5 to 15 minutes (depending on

total load) allows the JIT compiler to work without causing call failures or unacceptable latencies.

Another option, if sufficient reserve capacity exists, is to simply wait until cluster load drops to a level that a new node can cope with before

rebooting the failed node. This works, as the reserve capacity ensures that the node failure is not visible to callers.

For Java 7 update 45 and newer enabling tiered compilation may result in some improvement at node boot time. This will require a much, much

larger code cache as well due to the more aggressive compilation. This it does not avoid the initial very slow interpreted calls, it merely begins

compilation more aggressively, which improves performance faster. This option is on by default in Java 8 where the default code cache size has

been increased to handle the extra data.

OPTIONS="$OPTIONS -XX:+TieredCompilation -XX:InitialCodeCacheSize=512m"

Note that to avoid performance decreasing, the default code cache size must be set when using Java 7, as the default size of 48MB is not large

enough. Java 8 defaults to 240MB and fixes some performance bugs in flushing old compilation data however you may still see benefit from

increasing the code cache.

61

Page 62: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

5.4 Uneven CPU load/memory usage across cluster nodes

5.4.1 Symptoms

CPU load or memory usage is inconsistent across the cluster.

5.4.2 Diagnostic steps and correction

Examine the service usage statistics to rule out differences in applied load.

A small difference in CPU load is expected for nodes servicing management connections. These nodes will also consume slightly more heap and

Java Permanent Generation memory. If the amount of PermGen memory used on a node is greater than 90% of the maximum a CMS GC cycle

will be triggered to try to reclaim PermGen, When permgen remains above 90% CMS cycles will run almost continuously and cause a significant

increase in CPU usage. Check the heap statistics for the node or the GC log entries in the console log to determine if this is the cause of the

imbalance. Log entries will contain a section similar to the following with a high proportion of the Perm space used and only a small change after

GC:

[CMS Perm : 186608K->186604K(196608K)]

If the problem is memory related increase the allocated heap or PermGen so that there is sufficient free space for normal operation. For more

information on heap configuration see Java Virtual Machine Heap Issues on page 30

If the problem is not memory related investigate the balance of load across threads. A small number of threads consuming a large percentage

of CPU time may indicate a software bug or a load profile that is unusually serial in execution. On Linux the command top -ch will display the

busiest threads and the process they belong to. Run the command $RHINO_HOME/dumpthreads.sh to trigger a Java thread dump. The thread

IDs in the dump can be matched to the PIDs of the threads displayed in top by converting the PID to hexadecimal. Frequently the stack traces for

busy threads will be similar. This can suggest the part of the code that contains the bug and may help identify the trigger conditions. Contact your

solution provider for assistance, attaching the thread dump and list of busy threads.

62

Page 63: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

6 Configuration Problems

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for Rhino configuration.

6.1 Security Related Exceptions

Rhino provides a secured environment for its management infrastructure and for deployed components such as SBBs, Resource

Adaptors, Profiles and Mlets. If some component of this management infrastructure attempts to perform an unauthorized operation, then

`AccessControlException`s will be reported in the logging output of Rhino. The two most common forms are described in this section.

• Various connection related exceptions on page 63

• Various permission related exceptions on page 63

6.1.1 Various connection related exceptions

Symptoms

The default installation of Rhino is strict about which remote machines may connect to Rhino to perform management operations. The following

exceptions are common when a machine attempts to connect to Rhino that has not been explicitly configured. Examples of the two most common

messages are as follows.

A refused JMX-Remote connection produces the following message:

Exception in thread "RMI TCP Connection(6)-192.168.0.38" java.security.AccessControlException: access denied (java.net.SocketPermission 192.168.0.38:48070 accept,resolve)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)at java.security.AccessController.checkPermission(AccessController.java:427)at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)at java.lang.SecurityManager.checkAccept(SecurityManager.java:1157)at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:560) at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:208)at sun.rmi.transport.Transport$1.run(Transport.java:152)at java.security.AccessController.doPrivileged(Native Method)at sun.rmi.transport.Transport.serviceCall(Transport.java:149)

63

Page 64: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:460)at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:701)at java.lang.Thread.run(Thread.java:595)

This can be caused by a client/bin/rhino-console , client/bin/rhino-stats , client/bin/rhino-console, and external clients

such as the Rhino Element Manager or any other application which uses Rhino’s JMX-Remote connection support.

The following message is caused by the rhino-console connecting from a host which is not in the list of accepted IP addresses:

2016-12-01 12:18:38.444 ERROR [rhino.main] <RMI TCP Connection(idle)> Uncaught exception detected in thread Thread[RMI TCP Connection(idle),5,RMI Runtime]: java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve")java.security.AccessControlException: access denied ("java.net.SocketPermission" "192.168.2.33:34611" "accept,resolve") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:372) at java.security.AccessController.checkPermission(AccessController.java:559) at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) at java.lang.SecurityManager.checkAccept(SecurityManager.java:1170) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.checkAcceptPermission(TCPTransport.java:668) at sun.rmi.transport.tcp.TCPTransport.checkAcceptPermission(TCPTransport.java:305) at sun.rmi.transport.Transport$2.run(Transport.java:201) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

64

Page 65: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Resolution

In this case, the components can be given permission to accept connections from certain IP addresses or hostnames by entering those addresses

in the appropriate section in config/mlet-permachine.conf for the production system and in config/mlet.conf for the SDK. This is

described in more detail in the Rhino Administrators Guide .

If the connection is local, it may be necessary to add that local interface to the LOCALIPS variable in the config/config_variables file. Both

IPv4 and IPv6 addresses will need to be specified, and that node or SDK will need to be restarted.

6.1.2 Various permission related exceptions

6.1.3 Symptoms

Components inside Rhino run in a security context. These components include SBBs, Resource Adaptors, Profiles and Mlets. If a component

attempts to perform some operation which requires a security permission that the component does not have, then an exception is generated. An

example of this is when a Resource Adaptor attempts to write to the filesystem when it does not have permission to do so.

...WARN [rhino.remotetx] <StageWorker/RTM/1> Prepare failed:Node 101 failed: com.opencloud.slee.ext.resource.ResourceException: Could not create initial CDR file ...at java.lang.Thread.run(Thread.java:534)Caused by: java.io.IOException: Access denied trying to create a new CDR logfile...at java.security.AccessController.doPrivileged(Native Method)... 14 moreCaused by: java.security.AccessControlException: access denied (java.io.FilePermission ... read)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:269)at java.security.AccessController.checkPermission(AccessController.java:401)at java.lang.SecurityManager.checkPermission(SecurityManager.java:524)at java.lang.SecurityManager.checkRead(SecurityManager.java:863)at java.io.File.exists(File.java:678)... 19 more...

6.1.4 Diagnosis and Resolution

Typically this indicates that the component is not configured appropriately.

65

Page 66: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

As a step in diagnosing security-related problems, the security manager can be disabled by commenting out the following line in the read-config-

variables script:

OPTIONS="$OPTIONS -Djava.security.manager"

This is not recommended as a permanent solution and should not be used on a production system. Disabling the security manager should only be

used temporarily as a means to diagnose the problem in a test environment. To determine which permissions are required uncomment this line in

$RHINO_HOME/read-config-variables:

#OPTIONS="$OPTIONS \-Djava.security.debug=access:failure"

Rhino will then print the permissions that are failing security checks to the console log. The required permissions should be added to the

<security-permissions> section of the component deployment descriptor. For more information regarding granting additional permissions to

an RA or SBB, refer to the sections on RA and SBB Deployment Descriptors in the Rhino Administration Manual .

6.2 Memory Database Full

Rhino’s in-memory database (MemDB) has a specified fixed capacity. Transactions which execute against a MemDB will not commit if the

transaction would take MemDB past its fixed capacity. This typically occurs due to inadequate sizing of a Rhino installation but may indicate faulty

logic in a service or resource adaptor. Log messages containing "Unable to prepare due to size limits of the DB" or "Unable to prepare due to

committed size limits of MemDB Local" are a clear indicator of this problem. The symptoms are varied and are described below.

• Profile Management and/or provisioning failing on page 66

• Deployment failing on page 69

• [bxref#calls-not-being-setup-successfully[Calls not being setup successfully (refused requests)]

• [ Exceptions in Rhino logs on page

6.2.1 Profile Management and/or provisioning failing

Symptoms

Creating or importing profiles fails. The Rhino log contains a message with a log key profile.* similar to the following:

66

Page 67: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-12-01 13:41:57.081 WARN [profile.mbean] <RMI TCP Connection(2)-192.168.0.204> [foo:8] Error committing profile:javax.slee.management.ManagementException: Cannot commit transaction at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:36) at com.opencloud.rhino.management.SleeSupport.commitTx(SleeSupport.java:28) at com.opencloud.rhino.impl.profile.GenericProfile.commitProfile(GenericProfile.java:127) at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56) at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)

67

Page 68: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: com.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.management.TxSupport.commitTx(TxSupport.java:33) ... 48 more

Resolution and monitoring

If profile management and/or provisioning commands are unsuccessful due to the size restriction of the profile database installed in Rhino the size

of the database should be increased. To resize the profile database follow the instructions in Resizing MemDB Instances on page 73 to alter

the size of the ProfileDatabase.

68

Page 69: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

The following command can be used to monitor the size of the ProfileDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ProfileDatabase

6.2.2 Deployment failing

Symptoms

If deployment is unsuccessful this is possibly due to the size restriction of the ManagementDatabase installed in Rhino.

The error message and Rhino log will look similar to:

javax.slee.management.ManagementException: File storage error at com.opencloud.rhino.node.FileManager.store(FileManager.java:119) at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:400) at com.opencloud.rhino.management.deployment.Deployment.install(Deployment.java:267) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.DynamicMBeanSupport.doInvoke(DynamicMBeanSupport.java:157) at com.opencloud.rhino.management.DynamicMBeanSupport$1.run(DynamicMBeanSupport.java:121) at java.security.AccessController.doPrivileged(Native Method) at com.opencloud.rhino.management.DynamicMBeanSupport.invoke(DynamicMBeanSupport.java:113) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801) at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.opencloud.rhino.management.BaseMBeanInterceptor.intercept(BaseMBeanInterceptor.java:12) at com.opencloud.rhino.management.AuditingMBeanInterceptor.intercept(AuditingMBeanInterceptor.java:245) at com.opencloud.rhino.management.ObjectNameNamespaceQualifierMBeanInterceptor.intercept(ObjectNameNamespaceQualifierMBeanInterceptor.java:74) at com.opencloud.rhino.management.NamespaceAssociatorMBeanInterceptor.intercept(NamespaceAssociatorMBeanInterceptor.java:66) at com.opencloud.rhino.management.RhinoPermissionCheckInterceptor.intercept(RhinoPermissionCheckInterceptor.java:56)

69

Page 70: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at com.opencloud.rhino.management.CompatibilityMBeanInterceptor.intercept(CompatibilityMBeanInterceptor.java:130) at com.opencloud.rhino.management.StartupManagementMBeanInterceptor.intercept(StartupManagementMBeanInterceptor.java:97) at com.opencloud.rhino.management.RemoteSafeExceptionMBeanInterceptor.intercept(RemoteSafeExceptionMBeanInterceptor.java:26) at com.opencloud.rhino.management.SleeMBeanServerBuilder$MBeanServerInvocationHandler.invoke(SleeMBeanServerBuilder.java:38) at com.sun.proxy.$Proxy9.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487) at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328) at java.security.AccessController.doPrivileged(Native Method) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848) at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322) at sun.rmi.transport.Transport$2.run(Transport.java:202) at sun.rmi.transport.Transport$2.run(Transport.java:199) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:198) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:567) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.access$400(TCPTransport.java:619) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:684) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$1.run(TCPTransport.java:681) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:681) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)Caused by: javax.slee.TransactionRolledbackLocalException: Cannot commit transaction at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:47) at com.opencloud.rhino.node.FileManager.store(FileManager.java:112) ... 49 more

70

Page 71: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Caused by: com.opencloud.transaction.TransientRollbackException$$_Safe: Unable to prepare due to size limits of the DB at com.opencloud.transaction.local.OCTransactionImpl.commit(OCTransactionImpl.java:107) at com.opencloud.transaction.local.OCTransactionManagerImpl.commit(OCTransactionManagerImpl.java:190) at com.opencloud.rhino.facilities.FacilitySupport.commitTx(FacilitySupport.java:44) ... 50 more

Resolution and monitoring

If deployment commands are unsuccessful due to the size restriction of the management database installed in Rhino the size of the database

should be increased. To resize the management database follow the instructions in Resizing MemDB Instances on page 73 to alter the size of

the ManagementDatabase.

The following command can be used to monitor the size of the ManagementDatabase installed in Rhino.

./client/bin/rhino-stats -m MemDB-Replicated.ManagementDatabase

6.2.3 Calls not being setup successfully

If calls are not being setup successfully the failures may be caused by the size restriction of either or both of the LocalMemoryDatabase and

ReplicatedMemoryDatabase installed in Rhino.

If the service does not use replicated transactions the Rhino log will contain messages similar to this:

2016-11-30 09:08:01.732 WARN [rhino.er.stage.eh] <jr-74> Event router transaction failed to commitcom.opencloud.transaction.TransientRollbackException: Unable to prepare due to committed size limits of MemDB Local at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108) at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191) at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)

Replicated services will produce log entries similar to this:

71

Page 72: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-11-30 09:08:01.732 WARN [rhino.er.stage.eh] <jr-74> Event router transaction failed to commitcom.opencloud.transaction.TransientRollbackException: Unable to prepare due to size limits of the DB at com.opencloud.ob.Rhino.bO.commit(2.3-1.20-85576:108) at com.opencloud.ob.Rhino.me.commit(2.3-1.20-85576:191) at com.opencloud.ob.Rhino.AB.run(2.3-1.20-85576:95) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at com.opencloud.ob.Rhino.bq$a$a$a.run(2.3-1.20-85576:440)

Resolution and monitoring

If calls are failing due to the size restriction of the in-memory database installed in Rhino the size of the database should be increased. Data

should be collected before resizing to determine which database is reaching its size limit.

The following commands can be used to monitor the size of these databases installed in Rhino.

./client/bin/rhino-stats -m MemDB-Local.LocalMemoryDatabase

./client/bin/rhino-stats -m MemDB-Replicated.domain-0-DomainedMemoryDatabase

./client/bin/rhino-stats -m MemDB-Replicated.domain-0-ReplicatedMemoryDatabase

Note that it may be necessary to additionally monitor the output of the following memory databases in Rhino: MemDB-

Replicated.ManagementDatabase , and MemDB-Replicated.ProfileDatabase .

If Rhino is configured to use striping, the DomainedMemoryDatabase will be divided into stripes that are each limited to a fraction of the size

allocated to the MemDB instance. When using a striped MemDB it is possible for individual stripes to become full without the whole database

filling. This is typically a symptom of a highly asymmetric workload or a poorly designed service. To check for full stripes monitor all the stripe

statistics in the DB displaying problems. For example the DomainedMemoryDatabase stripes in an 8-stripe configuration are:

"MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-0" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-1" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-2" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-3" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-4" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-5" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-6" "MemDB-Replicated.domain-0-DomainedMemoryDatabase.domain-0-DomainedMemoryDatabase.stripe-7"

72

Page 73: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

If the committed size of the whole MemDB instance reported is close to the maximum, increase the configured size of this instance. If the

committed size of only one stripe is close to the maximum then the service is a poor match for striping. Reduce the number of stripes.

To resize a MemDB instance follow the instructions in Resizing MemDB Instances on page 73 to alter the size of the LocalMemoryDatabase,

DomainedMemoryDatabase or ReplicatedMemoryDatabase.

When monitoring the MemDB, watch the output of the churn, committedSize and maxCommittedSize fields. The churn field is the total change

in the content of the Memory Database in bytes. The committedSize field is the current committed size of the Memory Database in kilobytes.

The maxCommittedSize field is the maximum allowed committed size of the Memory Database in kilobytes. If the the difference between

committedSize and maxCommittedSize is close to the churn or less than 100kB it is possible that the Memory Database is not committing

transactions as they grow the database past its specified limit.

If the committedSize increases over time or does not approximately track the system load this may indicate a fault in a resource adaptor or

service. Increase the MemDB size to prevent call failures and contact your solution provider for assistance.

For other problems that cause dropped calls refer to sections Dropped Calls on page 58 , Java Virtual Machine heap issues on page 30 and

Application or resource adaptor heap issues on page 32 .

6.2.4 Resizing MemDB Instances

The general workaround for MemDB sizing problems is to make the appropriate Memory Database larger. To do this, edit the node-*/config/

rhino-config.xml file and look for a <memdb> or <memdb-local> entry which has a <jndi-name> with the appropriate Database name

(e.g. ProfileDatabase). Increase the <committed-size> of the database to increase its maximum committed size.

Increasing the size of a Memory Database also requires increasing the maximum heap size of the JVM to accomodate the larger database. To do

this, edit the node-*/config/config_variables file and increase the HEAP_SIZE by the same amount.

These steps will need to be performed for all nodes, and the nodes will need to be restarted for the change to take effect.

Initially try doubling the size of a database and monitor usage to determine the relationship between load and usage. Assuming usage is

proportional to call load you may then need to alter the configured size to accomodate the highest sustained load peak. If usage does not appear

to be proportional to load or the problem is not solved after a second increase in size, please contact your solution provider for support.

73

Page 74: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

6.3 Resource Adaptors refuse to connect using TCP/IP

Make sure that your SLEE is in the “Running” state. A SLEE that is not running will not activate resource adaptors so will neither listen nor initiate

connections.

If you have IPv6 installed on your machine, then Java may be attempting to use IPv6 rather than IPv4. Depending on the configuration of the

network services and host computer’s interfaces Java may be resolving hostnames differently from the other network components. If this is the

case then the resource adaptor may attempt to connect to the remote system using IPv6 when the system is only listening on an IPv4 address.

Less frequently the reverse may be true.

6.3.1 Diagnostic steps

Use the network diagnostic tools provided with your OS distribution to check the current connected and listening ports for running programs. One

common tool for this purpose is netstat :

$ netstat --inet6 -anp(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.)Active Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp6 0 0 :::1202 :::* LISTEN 6429/javatcp6 0 0 :::1203 :::* LISTEN 6429/javatcp6 0 0 :::53 :::* LISTEN -tcp6 0 0 :::22 :::* LISTEN -tcp6 0 0 ::1:631 :::* LISTEN -tcp6 0 0 ::1:5432 :::* LISTEN -tcp6 0 0 :::51673 :::* LISTEN -tcp6 0 0 :::9474 :::* LISTEN 6429/javatcp6 0 0 :::1199 :::* LISTEN 6429/javatcp6 0 0 :::111 :::* LISTEN -tcp6 0 0 :::22000 :::* LISTEN 6429/javatcp6 0 0 127.0.0.1:55682 127.0.0.1:5432 ESTABLISHED 6429/javatcp6 0 0 127.0.0.1:55684 127.0.0.1:5432 ESTABLISHED 6429/javaudp6 0 0 ::1:35296 ::1:35296 ESTABLISHED -udp6 0 0 :::45751 :::* -udp6 0 0 :::5353 :::* 7991/plugins --disa

74

Page 75: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

udp6 0 0 192.168.0.204:16100 :::* 6429/javaudp6 0 0 :::53 :::* -udp6 0 0 :::111 :::* -udp6 0 0 fe80::9eeb:e8ff:fe0:123 :::* -udp6 0 0 ::1:123 :::* -udp6 0 0 :::123 :::* -udp6 0 0 :::704 :::* -raw6 0 0 :::58 :::* 7 -

Check the firewall configuration on both the host running Rhino and the host providing the network function the RA connects to. Also check any

intermediate network components that provide filtering functions.

6.3.2 Workaround or Resolution

Try adding`-Djava.net.preferIPv4Stack=true` as a command-line argument to Rhino. This option can be set by adding a line EXTRA_OPTIONS=-

Djava.net.preferIPv4Stack=true to the RHINO_HOME/config_variables file. Other Java programs that connect to Rhino may also need

this argument added. For other programs an RA connects to consult the documentation for that program to learn how to configure the address

family to use. If a program cannot be configured appropriately you may need to disable support for IPv6 in the OS configuration.

If a firewall is configured to block the addresses used by the resource adaptor it must be reconfigured to allow connections.

6.4 Local hostname not resolved properly

In some cases the local name of a host (e.g. test-vm.example.com) is resolved wrongly (to 127.0.0.1 instead of its external address e.g.

192.168…). This results in remote connections being refused when trying to connect via management clients from a different host.

6.4.1 Symptoms• Connections via management clients are refused

6.4.2 Diagnostic steps

First check that proper permissions have been given in the appropriate sections of $RHINO_HOME/config/mlet.conf (SDK) respectively

$RHINO_HOME/etc/default/config/mlet-permachine.conf (Production).

75

Page 76: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

Check that no matter whether you are trying to connect through the Standalone Webconsole or the Commandline Console application you get an

exception similar to the one below:

...Connection failed to mycomputer:1199Connection failed to mycomputer:1199 (unreachable at Thu Apr 26 11:32:10 NZST 2007) javax.security.auth.login.LoginException: Could not connect...Caused by: com.opencloud.slee.client.SleeClientException: Could not connect to a Rhino management host at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:186) at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.connect(RhinoJmxClient.java:124) at com.opencloud.slee.mlet.shell.spi.jmx.RhinoJmxClient.login(RhinoJmxClient.java:242)

6.4.3 Workaround or Resolution

Edit the contents of the file /etc/hosts. It should contain entries as shown below.

...127.0.0.1 localhost192.168.123.123 mycomputer...

Entries that resolve the machine’s name to its local address should be commented out:

...#127.0.0.1 localhost.localdomain localhost mycomputer...

76

Page 77: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7 Management

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for Rhino management tools and utilities.

7.1 Connections Refused for the Command Console, Deployment Script or Rhino ElementManager

The remote management clients can not connect to the Rhino SLEE

7.1.1 Symptoms

The management clients show the following error when attempting to connect to the SLEE:

user@host:~/rhino/client/bin$ ./rhino-consoleCould not connect to Rhino: [localhost:1199] Connection refused -> This normally means Rhino is not running or the client is connecting to the wrong port.

Use -D switch to display connection debugging messages.

Could not connect to Rhino: [localhost:1199] No route to host

Use -D switch to display connection debugging messages.

Could not connect to Rhino: [localhost:1199] Could not retrieve RMI stub -> This often means the m-let configuration has not been modified to allow remote connections.

Use -D switch to display connection debugging messages.

BUILD FAILED~/rhino/client/etc/common.xml:99: The following error occurred while executing this line:~/rhino/client/etc/common.xml:77: error connecting to rhino: Login failed

77

Page 78: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.1.2 Diagnostic steps and correction

Rhino is not listening for management connections

First, check that there is a running Rhino node on the host the client is trying to connect to. Use the ps command to check that the Rhino

process is running, e.g. ps ax | grep Rhino . If Rhino is running, check the rhino.log to determine if the node has joined the primary

component and started fully. If the Rhino node is failing to join the primary component or otherwise failing to fully start then consult the Clustering

troubleshooting guide.

Make sure that the remote host is accessible using the ping command. Alternatively, make sure that you can log in to the remote host using ssh to

make sure the network connection is working (some firewalls block ping).

Rhino refuses connections

By default, Rhino is set up to not allow remote connections by management clients. Permissions to do so need to be manually configured before

starting the SLEE, as described in the next section.

The management clients connect to Rhino via SSL secured JMX connections. These require both a client certificate and permission to connect

configured in the Java security configuration for Rhino.

To allow remote connections to the SLEE, the MLet configuration file will need to be edited. On the SDK version of Rhino, this is in

$RHINO_HOME/config/mlet.conf and for the Production version of Rhino, this is in $RHINO_HOME/node-???/config/mlet-

permachine.conf for each node.

Edit the MLet configuration file and add the following permission to the JMXRAdaptor MLet security-permission-spec. This should already be

present but commented out in the file. You will need to replace “host_name” with either a host name or a wildcard (e.g. * ).

grant { permission java.net.SocketPermission "{host_name}", "accept,resolve";}

It is also possible that the Rhino SLEE host has multiple network interfaces and has bound the RMI server to a network interface other than the

one that the management client is trying to connect to.

If this is the case then the following could be added to file:$RHINO_HOME/read-config-variables for the SDK or file:$RHINO_HOME/

node-???/read-config-variables for the Production version of Rhino:

78

Page 79: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

OPTIONS="$OPTIONS -Djava.rmi.server.hostname={public IP}"

Rhino will need to be restarted in order for any of these changes to have effect. For the SDK, this simply means restarting it. For the Production

version, this means restarting the particular node that has had these permissions added.

Management client is not configured to connect to the Rhino host

Make sure that the settings for the management clients are correct. For rhino-console, these are stored in client/etc/client.properties .

You can also specify the remote host and port to connect to using the -h <hostname> and -p <port> command-line arguments. If the SLEE

has been configured to use a different port than the standard one for management client connections (and this has not been configured in the

client/etc/client.properties files), then the port will also need to be specified on the command-line arguments.

If connecting to localhost then the problem is likely to be a misconfigured /etc/hosts file causing the system to resolve localhost to an

address other than 127.0.0.1 .

For Ant deployment scripts run with ant -v and ant will tell you the underlying exception which will provide more detail.

To run the command console or run deployment scripts from a remote machine:

1. Copy $RHINO_HOME/client to the host

2. Edit the file client/etc/client.properties and change the remote.host property to the address of the Rhino host

3. Make sure your Ant build script is using the correct client directory. The Ant property ${client.home} must be set to the location of

your client directory

7.2 A Management Client Hangs

The management clients use SSL connections to connect securely to Rhino. To generate keys for secure connections, these read (and block

doing so) from the /dev/random device. The /dev/random device gathers entropic data from the current system’s devices, but on an idle

system it is possible that the system has no entropy to gather, meaning that a read on /dev/random will block.

7.2.1 Symptoms

A management client hangs for a long period of time on start-up as it tries to read from /dev/random .

79

Page 80: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.2.2 Workaround or Resolution

The ideal problem resolution is to create more system entropy. This can be done by wiggling the mouse, or on a remote server by logging in

and running top or other system utilities. Refer also to the operating system’s documentation, on Linux this is the random(4) man page: man 4

random .

7.3 Statistics client reports “Full thread sample containers”

If statistics gathering is done at a sampling rate which is set too high, the per-thread sample containers may fill before the statistics client can read

the statistics out of those containers.

7.3.1 Symptoms

When gathering statistics, the following may appear in the logs:

2006-10-16 12:59:26.353 INFO [rhino.monitoring.stats.paramset.Events]<StageWorker/Misc/1> [Events] Updating thread sample statisticsfound 4 full thread sample containers

This is a benign problem and can be safely ignored. The reported sample statistics will be slightly inaccurate. To prevent it, reduce the sampling

rate.

7.4 Statistics Client Out of Memory

When running in graphical mode, the statistics client will, by default, store 6 hours of statistical data. If there is a large amount of data or if the

statistics client is set to gather statistics for an extended period of time, it is possible for the statistics client to fail with an Out of Memory Exception.

7.4.1 Symptoms

The statistics client will fail with an Out of Memory Exception.

80

Page 81: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.4.2 Workaround or Resolution

The user is recommended to use the -k option of the statistics client when running in graphical mode to limit the number of hours of statistics kept.

If it is required that statistics be kept for a longer period, it is recommended that the statistics client be run in command-line mode and the output

be piped to a text file for later analysis.

For more information, run client/bin/rhino-stats without any parameters. This will print a detailed usage description of the statistics client.

7.5 Creating a SyslogAppender gives an AccessControlException

Creating a SyslogAppender using the following entry in logging.xml will not work, as the appender does not perform it’s operations using the

proper security privileges:

<appender appender-class="org.apache.log4j.net.SyslogAppender" name="SyslogLog"> <property name="SyslogHost" value="localhost"/> <property name="Facility" value="user"/></appender>

7.5.1 Symptoms

The following error would appear in Rhino’s logs:

2006-10-19 15:16:02.311 ERROR [simvanna.threadedcluster] <ThreadedClusterDeliveryThread> Exception thrown in delivery thread java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:514 connect,resolve)at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)at java.security.AccessController.checkPermission(AccessController.java: 427)at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)at java.lang.SecurityManager.checkConnect(SecurityManager.java:1034)at java.net.DatagramSocket.send(DatagramSocket.java:591)at org.apache.log4j.helpers.SyslogWriter.write(SyslogWriter.java:69)at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:39)at org.apache.log4j.helpers.SyslogQuietWriter.write(SyslogQuietWriter.java:45)at org.apache.log4j.net.SyslogAppender.append(SyslogAppender.java:245)

81

Page 82: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.5.2 Workaround or Resolution

For the case where a SyslogAppender is required, the createsyslogappender command of the rhino-console provides a much easier user

interface to achieve this task.

Replacing the entry org.apache.log4j.net.SyslogAppender above with com.opencloud.rhino.logging.RhinoSyslogAppender

will also fix this problem. The Open Cloud version of the SyslogAppender is a simple wrapper around the Log4J version which wraps the

append(LoggingEvent event) method in a “doPrivileged” block. For custom appenders not provided with Rhino, the same method can be used:

public void append(final LoggingEvent event) { AccessController.doPrivileged(new PrivilegedAction() { public Object run() { RhinoSyslogAppender.super.append(event); return null; } });}

7.6 Platform Alarms

Rhino raises alarms in various situations, some of which are discussed in this section for troubleshooting purposes. The full list of current Rhino

core alarms is available using the alarmcatalog command in the rhino-console .

7.6.1 Symptoms• Alarm notification messages in Rhino logs

• Alarms appearing in network management systems

• Entries present in rhino-console command listactivealarms

7.6.2 Diagnostic steps

The presence of an alarm may be viewed by the output of the following command.

./client/bin/rhino-console listactivealarms

82

Page 83: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

1. Upon the loss of a node from the cluster an alarm with alarm type of rhino.node-failure and an alarm source of

ClusterStateListener is raised. This alarm is cleared either by the administrator or when the node rejoins the cluster. This alarm

is not raised for quorum nodes.

2. If a user rate limiter capacity is exceeded an alarm with an alarm source of ThresholdAlarms is raised. This alarm is cleared when

the event rate drops below the limiters configured capacity.

3. If a JMX mlet cannot be started successfully an alarm with an alarm source of MLetStarter is raised. These alarms must be

cleared manually.

4. If the rule for any user-defined threshold-based alarm is met an alarm with a user defined alarm type and alarm source is raised.

These alarms are cleared when the rule condition is no longer met or if the administrator clears them

5. The licenses installed on the platform are insufficient for the deployed configuration: The alarm type is “rhino.license” and the alarm

source is “LicenseManager”.

a. A license has expired.

b. A license is due to expire in the next seven days.

c. A license units are being processed for a currently unlicensed function.

d. The consumption rate for a particular license is greater than the consumption rate which the license allows.

The alarm type used in notifications that are reporting an alarm has cleared is the original alarm type plus .clear , for example rhino.node-

failure.clear .

7.6.3 Workaround or Resolution

Alarms with an alarm source of ThresholdAlarms indicate that the system is receiving more input that it has been configured to receive.

Alarms with an alarm source of LicenseManager indicate that a Rhino installation is not licensed appropriately. Other alarms are either user

defined licenses, or defined by an Application or Resource Adaptor.

Alarms with an alarm source of MLetStarter or some other non-obvious key usually indicate a software issue, such as a misconfiguration of the

installed cluster.

In most of these cases, the remedy is to contact your solution support provider for a new license or for instructions on how to remedy the situation.

83

Page 84: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.7 DeploymentException when trying to deploy a component

7.7.1 Symptoms

DeploymentException when trying to deploy a component.

7.7.2 Diagnostic steps

Native Library XXXXLib.so already loaded in another classloader

Each time you deploy the RA, it happens in a new classloader (because the code may have changed). If no class GC has happened, or if

something is holding a reference to the old classloader and keeping it alive, the old library will still be loaded as well. See http://java.sun.com/docs/

books/jni/html/design.html#8628

modifying the deployable-unit.xml file within it and the output of jar -tvf looks similar to the below:

$ jar -tvf service.jar1987 Wed Jun 13 09:34:02 NZST 2007 events.jar76358 Wed Jun 13 09:34:02 NZST 2007 sbb.jar331 Wed Jun 13 09:34:02 NZST 2007 META-INF\deployable-unit.xml106 Wed Jun 13 09:34:02 NZST 2007 META-INF\MANIFEST.MF693 Wed Jun 13 09:34:02 NZST 2007 service-jar.xml

7.7.3 Workaround or Resolution• Restart Rhino nodes before redeployment

• Force a full GC manually before redeployment (Requires Rhino to be configured with -XX:-DisableExplicitGC )

• Change the JNI library name whenever redeploying

• Ensure the classes that use JNI are loaded by a higher-level classloader, e.g. the Rhino system classloader or a library. (of course,

that also means you can’t deploy new versions of those classes at runtime)

Jars always use forward slashes ("/") as a path separator. Repackage the DU jar with a different file archiver, preferrably the jar tool.

84

Page 85: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.8 Deploying to multiple nodes in parallel fails

7.8.1 Symptoms

You are deploying Rhino using a script that creates and deploys components to multiple nodes asynchronously. The deployment fails with one of

the following exceptions on each node. When deploying the nodes serially, one after the other, no exceptions are reported.

WARN [rhino.management.deployment]Installation of deployable unit failed:javax.slee.management.AlreadyDeployedException: URL already installed: file:/opt/rhino/apps/sessionconductor/rhino/dist/is41-ra-type_1.2-du.jarat com.opencloud.rhino.management.deployment.Deployment.install(4276)

[WARN, rhino.management.resource, RMI TCP Connection(4)-192.168.84.173] -->Resource adaptor entity creation failed: java.lang.IllegalStateException: Not in primary componentat com.opencloud.ob.Rhino.runtime.agt.release(4276)...

7.8.2 Diagnostic steps

Rhino provides single system image for management. You do not need to deploy a DU on each node in a cluster. Installing a deployable unit on

any node in Rhino cluster propagates that DU to all nodes in the cluster, so if the DU had already deployed via node 102, it can’t also be deployed

via node 101.

In addition, if a new node is created and joins a running cluster, it will be automatically synchronised with the active cluster members (i.e DUs

installed, service states, log levels, trace levels, alarms etc).

A Rhino cluster will only allow one management operation that modifes internal state to be executed at any one time, so you can’t, for example,

install a DU on node 101 and a DU on node 102 at the same time. One of the install operations will block until the other has finished. You can run

multiple read-only operations simultaneously, though.

7.8.3 Workaround or Resolution

Create all nodes from the same base install, optionally starting the nodes in parallel. Wait for the nodes to start then run the component installation

script against only one node.

85

Page 86: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.9 Management of multiple Rhino instances

7.9.1 Symptoms

You are trying to use rhino-console to talk to multiple rhino instances however it will not connect to the second instance.

7.9.2 Workaround or Resolution

Unfortunately it is not possible to store keys for multiple Rhino instances in the client’s keystore, they are stored using fixed aliases. With the

current implementation, there are two ways to connect to multiple Rhino instances from a single management client:

1. Copy the rhino-private.keystore to all the Rhino home directories so that all instances have the same private key on the

server. This may be adequate for test environments.

2. Create a copy of client.properties that points to a different client keystore, and tweak the scripts to parameterise the

client.properties Java system property. Example:

OPTIONS="$OPTIONS -Dclient.properties=file:$CLIENT_HOME/etc/${RMISSL_PROPERTIES:client.properties}"

If doing this you may also want to parameterise the keystore password to restrict access to authorised users.

7.10 Deployment problem on exceeding DB size

7.10.1 Symptoms

Deployment fails with Unable to prepare due to size limits of the DB

7.11 Diagnostic steps

See Memory Database Full on page 66 for how to diagnose and resolve problems with the size of the Rhino in-memory databases, including the

management database.

86

Page 87: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.12 BUILD FAILED when installing an OpenCloud product

7.12.1 Symptoms

Installation fails with an error like:

$:/opt/RhinoSDK/cgin-connectivity-trial-1.5.2.19 # ant -f deploy.xmlBuildfile: deploy.xml

management-init: [echo] Open Cloud Rhino SLEE Management tasks defined

login:

BUILD FAILED/opt/RhinoSDK/client/etc/common.xml:102: The following error occurred while executing this line:/opt/RhinoSDK/client/etc/common.xml:74: No supported regular expression matcher found: java.lang.ClassNotFoundException: org.apache.tools.ant.util.regexp.Jdk14RegexpRegexp

Total time: 0 seconds

7.12.2 Diagnostic steps

Run Ant with debugging output to check which version of Ant is being used and the classpath

ant -d -f deploy.xml > output.txt

7.12.3 Workaround or Resolution

Add the missing libraries to your Ant lib directory

We recommend you use the Ant version shipped with Rhino to avoid this problem e.g.

/opt/RhinoSDK/client/bin/ant -f deploy.xml

87

Page 88: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.13 REM connection failure during management operations

7.13.1 Symptoms

Performing a management operation, e.g. activating an RA entity, fails with the following error:

Could not acquire exclusive access to Rhino server

7.13.2 Diagnostic steps

The message is sometimes seen when Rhino is under load and JMX operations are slow to return. Check the CPU load on the Rhino servers.

REM Exceptions of this type can also occur when stopping or starting the whole cluster.

When REM auto-refresh interval is set to low value (default is 30 seconds) there is high likelihood of a lock collision happening. With higher auto-

refresh intervals the likelihood drops down. With auto-refresh interval set to "Off" the exception may not occur at all.

If the rem.interceptor.connection log key is set to DEBUG in REM’s log4j.properties , then the logs will show which operations could

not acquire the JMX connection lock.

7.13.3 Workaround or Resolution

If the CPU load on the Rhino server is high then follow the resolution advice in Operating environment issues on page 28 .

If the auto-refresh interval is low then increase it until the problem stops.

For further diagnostic and resolution assistance contact Open Cloud or your solution provider, providing the REM logs.

7.14 Export error: Multiple Profile Snapshot for profiles residing in seperate memdb instances isunsupported

7.14.1 Symptoms

Trying to export Rhino configuration with rhino-export fails with an error like:

88

Page 89: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

com.opencloud.ui.snapshot.SnapshotClientException: Multiple Profile Snapshot for profiles residing in seperate memdb instances is unsupportedat com.opencloud.ob.client.be.a(80947:202)at com.opencloud.ui.snapshot.SnapshotClient.performProfileTableSnapshot(80947:294)at com.opencloud.rhino.management.exporter.Exporter.b(80947:382)at com.opencloud.rhino.management.exporter.Exporter.a(80947:350)at com.opencloud.rhino.management.exporter.Exporter.run(80947:291)at com.opencloud.rhino.management.exporter.Exporter.main(80947:201)Press any key to continue...

7.14.2 Workaround or Solution

Run rhino-export with the -J option to use JMX for exporting the profile data. This is slightly less efficient but can handle multiple profile

storage locations.

7.15 Unused log keys configured in Rhino

7.15.1 Symptoms

After installing multiple versions of a service on Rhino listlogkeys reports a number of obsolete keys. How can these be removed?

7.15.2 Workaround or Resolution

The unused log keys are marked for removal - they should disappear when the Rhino JVM restarts. Until then they are harmless.

7.16 Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT

7.16.1 Symptoms

Any management operation done on the cluster fails with:

2014-06-12 11:40:39.156 WARN [rhino.management.trace] <RMI TCP Connection(79)-10.240.83.131> Error setting trace level of root tracer for SbbNotification[service=ServiceID[name=TST,vendor=XXX,version=1.2.0],sbb=SbbID[na

89

Page 90: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

me=SDMSbb,vendor=XXX,version=1.2.0]]: com.opencloud.savanna2.framework.lock2.LockUnavailableException: Timeout waiting for distributed lock acquisition: lock=LOCK_MANAGEMENT, current owners=[TransactionId:[101:153930147780781]] at com.opencloud.ob.Rhino.aR.a(2.3-1.12-72630:662) at com.opencloud.ob.Rhino.aR.acquireExclusive(2.3-1.12-72630:65) at com.opencloud.rhino.management.trace.Trace.setTraceLevel(2.3-1.12-72630:256) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

7.16.2 Diagnostic steps

The management lock is acquired and held for the duration of all management operations to prevent concurrent modification of Rhino state and

on-disk data. To find out more information about the transaction holding the lock use the gettransactioninfo console command. To find out

what method is blocking release of the lock use jstack $(cat node-???/work/rhino.pid) or kill -QUIT $(cat node-???/work/

rhino.pid) to dump the current thread state to the Rhino console log.

7.16.3 Workaround or resolution

Contact your solution provider with the Rhino logs showing the problem and a list of the management operations that were performed immediately

prior to the one that timed out. If the management operation is permanently blocked, e.g. by an infinite loop in the raStopping() callback of an RA,

the cluster will need to be restarted to interrupt the stuck operation. If it is not permanently blocked you must wait until the operation has finished.

7.17 Log level for trace appender not logging

7.17.1 Symptoms

Setting the log level for a logger trace.??? does not change the level of information logged under this key. The log level was set using the

setloglevel command.

7.17.2 Workaround or Resolution

The log keys trace.??? are special. These are SLEE tracers which have their own level configuration that they feed into the logging subsystem.

Use the settracerlevel command to set tracer levels for RA entities and SBBs.

90

Page 91: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

7.18 Access to REM fails with Command CHECK_CONNECTION invoked without connection ID

7.18.1 Symptoms

After updating or reinstalling REM access fails with an error in the REM log similar to:

2012-02-24 16:16:17.204 ERROR [rem.server.http.connection] <btpool0-6> Command CHECK_CONNECTION invoked without connection ID

7.18.2 Workaround or Resolution

The most common cause for these errors is that the browser-hosted part of REM does not match the server code. Refresh the browser tab to

reload the client code. You may need to clear the browser cache.

91

Page 92: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

8 Database

Below are troubleshooting steps -- symptoms, diagnostic steps, and workarounds or resolutions -- for the Rhino management database persistent

store.

8.1 Management Database Server Failure

Rhino uses either the PostgreSQL or Oracle Database Server as a non-volatile storage mechanism. It is possible that the database server can

terminate due to administrative action or software failure. Failure to communicate with the database server may also occur due to a network fault.

This is a problem related to the database only and is not indicative of a problem in Rhino.

A Rhino cluster gracefully handles the failure of a management database server, however if all nodes in the cluster fail and are restarted before

the database server is restarted, then the cluster will pause in its booting process until a server is available. If all nodes in the cluster fail during the

period the database server was unavailable, then any configuration or profile changes made after the database failure will be lost.

8.1.1 Symptoms

Warning messages in the system console from the log key memdb.storer

Messages similar to the following indicate that a PostgreSQL Database Server has terminated either due to administrative action or failure.

...WARN [memdb.storer] <SQLStableStorage Storer thread for rhino_profiles> [ProfileDatabase/rhino_profiles] Connection failedorg.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239) ... at java.lang.Thread.run(Thread.java:595)Caused by: java.io.EOFException at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243)

92

Page 93: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175) ... 8 moreWARN [memdb.storer] <SQLStableStorage Storer thread for rhino_management> [ManagementDatabase/rhino_management] Connection failed org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:201) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:329) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:239) ... at java.lang.Thread.run(Thread.java:595)Caused by: java.io.EOFException at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:243) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1122) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)... 8 more...

8.1.2 Resolution and Mitigation

The management database server should be restarted as soon as possible. Once the database has been restarted and the Rhino cluster has

reconnected, messages similar to the following should appear in the Rhino logs.

...INFO [memdb.storer] <SQLStableStorage Storer thread for rhino_profiles> [ProfileDatabase/rhino_profiles] Connected to databaseINFO [memdb.storer] <SQLStableStorage Storer thread for rhino_management> [ManagementDatabase/rhino_management] Connected to database...

To reduce the risk of this causing data loss Rhino can be configured to replicate persisted data across multiple PostgreSQL instances. When

using Oracle as the persistence database, replication is provided by the Oracle RAC clustering mechanism.

93

Page 94: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

9 Signalware

Below are troubleshooting steps — symptoms, diagnostic steps, and workarounds or resolutions — for using Signalware with Rhino’s CGIN

Resource Adaptor.

9.1 CGIN RA to Signalware Backend Connection Errors

Rhino’s CGIN resource adaptor for the SS7 protocol family requires a native backend process, known as a Signalware backend, to communicate

with the Signalware SS7 stack. The Signalware backend process must run on the Signalware host systems, and the resource adaptor

communicates with it via TCP. Generally, the cause is configuration related or a version mismatch between the resource adaptor and backend.

9.1.1 Symptoms• Resource Adaptor cannot connect to backend.

• Resource Adaptor drops connection to backend.

9.1.2 Diagnostic steps

If, when activating the resource adaptor (or starting the Rhino SLEE with the resource adaptor active) the following message immediately appears

in the Rhino logs and an alarm is raised stating that a backend connection has been lost, then this indicates that the backend process is not

reachable by the resource adaptor. Generally this would mean it is either not running or, due to resource adaptor or network misconfiguration, the

resource adaptor is unable to establish a TCP connection to it.

2013-07-18 14:54:24.907 Major [rhino.facility.alarm.manager] <New I/O worker #47> Alarm 101:140617544395779 [RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102] was raised at 2013-07-18 14:54:24.907 to level Major Lost connection to backend localhost:101022013-07-18 14:54:24.907 INFO [rhino.facility.alarm.csv] <New I/O worker #47> 2013-07-18 14:54:24.907,raised,2013-07-18 14:54:24.907,101:140617544395779,RAEntityNotification[entity=insis-ptc-external],noconnection,localhost:10102,Major,Lost connection to backend localhost:10102,

94

Page 95: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

If the resource adaptor successfully established a connection to the backend but later one or both of the following messages appears in the logs

this indicates that the connection to the backend process was setup successfully but lost due to backend or network failure.

2011-03-29 10:20:30.942 Warning [trace.cginra.backend.[sedition:24146]] <RAContextTimer-cginra> sedition:24146#4: connection heartbeat lost. Last sent: 1301347220, last received: 1301347210

2016-11-17 16:20:26.070 Major [rhino.facility.alarm.manager] <cginra-thread-1> Alarm 101:194478654849023 [RAEntityNotification[entity=cginra],noconnection,autotests-signalware:10101] was raised at 2016-11-17 16:20:26.069 to level Major Lost connection to backend autotests-signalware:101012016-11-17 16:20:26.071 Warning [trace.cginra.backend.[autotests-signalware:10101]] <cginra-thread-1> Connection autotests-signalware:10101#1 lost: Remote host closed connection

9.1.3 Workaround or Resolution

First determine whether the backend process is running. If the backend process is not running then either it or Signalware may be incorrectly

configured.

If the backend process is running check that the host and port it is listening on is correctly configured in the resource adaptor entity configuration

properties. Also verify that the host running the backend is reachable from the Rhino host and that the port is not blocked by a firewall on either

host.

If a connection is established but gets dropped this indicates either a software failure in the backend or a network level failure. Check that the

backend process is running and if not, restart it. Any error messages from a failed backend should be sent to your solution provider to determine

the cause of failure. If the backend process is still running check for any network connectivity problems between the Rhino host and the host

running the backend processes.

9.2 CGIN RA Cannot Create Outgoing Dialogs

If one of Rhino’s Signalware Resource Adaptors has more than the configured maximum number of active dialogs with a single backend process it

will fail to allocate new dialogs.

9.2.1 Symptoms• Applications fail to create outgoing dialogs using the CGIN Resource Adaptor

95

Page 96: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

9.2.2 Diagnostic steps

If the backend process has reached its maximum number of active dialogs an warning similar to the following will be logged by the Resource

Adaptor:

2016-11-17 14:49:42.410 Warning [trace.cginra.backend.[signalware:10100]] <cginra-thread-2> Unable to allocate a dialog handle: Out of dialog ids

Additionally an error similar to the following will be visible in the output of the backend process:

2016-11-17 11:45:28.823003 rhino_252_1: Failed to allocate a new dialog ID. errno=7

9.2.3 Workaround or Resolution

By default each backend process supports up to 32,000 simultaneous dialogs. If the number of simultaneous dialogs exceeds this limit the above

errors will occur. There are two approaches to working around this issue:

• Increase the number of dialogs per backend process using the -maxdialogs N parameter (where N is the number of dialogs. Up to

32,000 concurrent dialogs are supported).

• If the backend process is already configured for the maximum 32,000 dialogs add additional backend processes, then, to support

more active calls, spread the load across additional processes. See the https://developer.opencloud.com/devportal/display/CGIN1v5/

Using+the+Ulticom+Signalware+TCAP+Stack for details.

9.3 CGIN RA Cannot Receive Incoming Dialogs

9.3.1 Symptoms• Incoming dialogs do not arrive at CGIN RA.

9.3.2 Diagnostic steps

If the backend process has reached its maximum number of active dialogs the Signalware backend process will log something similar to the

following:

96

Page 97: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

2016-11-17 13:12:39.035658 rhino_252_1: cTCAPTakeMsg() failed: (ctcap_errno=12), Out of Dialog entries

9.3.3 Workaround or Resolution

The root cause and solution for this is identical to that for CGIN RA Cannot Create Outgoing Dialogs on page 95 . Please see that section for

details.

9.4 Problems with Signalware not involving the CGIN backends

Please refer to the Signalware Installation Manual for more information about installing and configuring Signalware. If you do not find a solution for

your problem, contact your solution provider and provide the system report created by the dumpce command.

97

Page 98: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

10 Exit Codes

Below is a list of the possible exit codes of the Rhino process.

10.1 Rhino Exit Codes

Exit Code Error Action

1 Unspecified error. See the Rhino and console logs for details.

2 Misconfiguration found while reading a config

file. See the Rhino and console logs for

details.

Check the config files and fix any problems

reported in the logs.

3 Watchdog shutdown Check the Rhino log for the cause. Most

frequently this is caused by system overload

or a faulty service.

4 Timeout waiting for normal shutdown to

complete

No action required

5 Restart of node requested by management

operation

No action required

10.2 JVM Exit Codes

The JVM reports a number of exit codes that may be added to from version to version. At the time of writing this they include the following:

98

Page 99: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

10.2.1 Internal JVM Exit Codes

No known exit codes other than 0 (Normal exit)

10.2.2 Signal Handler Exit Codes

Exit codes above 128 occur as a result of the JVM receiving a signal from the OS. These are always a result of either OS process management

or a JVM bug that has triggered an OS level protection mechanism. In many cases the console log will have information on the cause. There may

also be information in the system log. On Linux these are the following, other OSes differ:

Signal Name Signal Number Exit Code Common causes

HUP 1 129 • User closed the

terminal running

Rhino.

• Operator sent signal

to the Rhino process.

INT 2 130 • Ctrl-C in the terminal

running Rhino.

• Operator sent signal

to the Rhino process.

QUIT 3 N/A JVM does not exit. Operator triggered JVM thread

stack dump.

ILL 4 134 JVM Bug

TRAP 5 133 Bug in a debugger attached to the

Rhino JVM

ABRT/IOT 6 134 JVM bug

99

Page 100: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

BUS 7(x86) 10(SPARC) 135 JVM bug

FPE 8 134 JVM bug

KILL 9 137 Operator or OS killed the Rhino

process (Out of memory or

system shutdown)

USR1 10(x86) 30(SPARC) 138 Operator sent a signal to the

Rhino process

SEGV 11 134 JVM bug

USR2 12(x86) 31(SPARC) 134 Operator sent a signal to the

Rhino process

ALRM 14 142 Native library set an OS alarm

timer

TERM 15 143 Operator or OS killed the Rhino

process (system shutdown)

STKFLT 16(x86 only) 144 JVM bug

XCPU 24(x86) 30(SPARC) 152 Ulimit on CPU time

VTALRM 26(x86) 28(SPARC) 154 Native library set an OS alarm

timer

PROF 27(x86) 29(SPARC) 155 Bug in profiler attached to the

Rhino JVM

100

Page 101: Rhino Troubleshooting Guide - Metaswitch · 2018. 9. 17. · Rhino Troubleshooting Guide (V2.5.0)2 Common Symptoms Below are troubleshooting steps — symptoms, diagnostic steps,

Rhino Troubleshooting Guide (V2.5.0)

IO 29(x86) 22(SPARC) 157 Bug in native library performing

asynchronous IO

SYS 31(x86) 12(SPARC) 140 Bug in JVM, native library or

kernel

10.3 Other exit codes

Some libraries may call System.exit() with a code not in this list. We are not aware of any libraries used by Rhino that do this and recommend

avoiding any that do when developing services.

101