Upload
amit-deo
View
1.177
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Oracle SOA Suite 11g Troubleshooting Methodology
Compiled by :Amit Deo,Oracle FMW SME Consultant
Note:The middleware Universe is full of Workarounds :)
Slide 2 of 64 © |
1. Introduction
2. The Problem
3. The Basics of Troubleshooting: Where Do You Start?
4. Infrastructure Issues
5. Performance Issues
6. Deployment Issues
7. Summary
Agenda
Slide 3 of 64 © |
INTRODUCTION
Slide 5 of 64 © |
THE PROBLEM
Slide 6 of 64 © |
T-Mobile's support team had an exceedingly difficult time
pinpointing the specific cause of the problem.
Not only did the team involve representatives for each IT
functional area, they had no way to troubleshoot from the
source and no one team had visibility of the complete
picture.
In general resolving problems took the T-Mobile'smelded support team approximately multiple days.
How Every Large Company Troubleshoots
Slide 8 of 64 © |
In the past, App and network admins were to blame for
everything.
Problem With Troubleshooting Integrations
Slide 9 of 64 © |
In the FMW Universe, the integration folks are the new target.
Problem With Troubleshooting Integrations
Slide 10 of 64 © |
Numerous touch points
Numerous SOA technologies
Focus of this document is on Oracle SOA Suite 11g
Problem With Troubleshooting Integrations W
eb
Ap
plic
ati
on
OE
G
OS
B
SO
A S
uit
e
OS
B
ODI/OAM/OIM
1
3
2
4
Slide 11 of 64 © |
We created WLST wrapper script that loops through and
performs garbage collection for all managed servers
OSB relentlessly fails over HTTPS or due toother connectivity reasons
Always getting OutOfMemoryError: PermGen space
after new installs/deployments
Weird… but at least consistent
Real World Scenario – Bizarre Behaviour
Slide 12 of 64 © |
Real World Scenario – Convoluted & Unclear
The infamous and ever misleading “Unable to access the
following endpoints” error
Slide 13 of 64 © |
Could be:
Caused by: java.net.SocketTimeoutException:
Read timed out
Message send failed:
sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBu
ilderException: unable to find valid
certification path to requested target
Real World Scenario – Convoluted & Unclear
Slide 14 of 64 © |
THE Basic Principles OF TROUBLESHOOTING: WHERE DO
YOU START?
Slide 15 of 64 © |
Part skill
Some people have natural tendency to pinpoint problem areas
Can be learned; usually involves methodical approach and logic
Part knowledge
Without understanding the product, doesn’t matter how smartyou are :)
Most frustrating when it’s related to an area we don’t know
What is Troubleshooting?
Slide 16 of 64 © |
Co-Workers
Internet searches
OTN discussion forumshttp://support.oracle.com
My Oracle Supporthttp://support.oracle.com
Oracle Troubleshooting Guide
http://docs.oracle.com/cd/E15586_01/fusionapps.1111/e14496/soa_trouble.htm
Oracle SOA Suite 11g Administrator’s Handbookhttp://www.packtpub.com/oracle-soa-suite-11g-administrators-handbook/book
Existing Resources
Slide 17 of 64 © |
Start Somewhere – Narrow Down Problem Area
Issues
Performance
Server-wide Service-specific
Runtime
Composite Infrastructure
Deployment
Slide 18 of 64 © |
INFRASTRUCTURE ISSUES
Slide 19 of 64 © |
Could be a server issue
Could be a coding issue
Could be a business fault that should be handled by the
code..Contact Dev Teams
Must be able to differentiate between infrastructure errors
and composite instance errors
Troubleshooting the Infrastructure
Slide 20 of 64 © |
1. Use logs
2. Use thread dumps
Troubleshooting the Infrastructure
Slide 21 of 64 © |
The soa_server1.out log file contains most runtime
issues.For all other issues refer to the servername.log file.
Must differentiate between infrastructure errors and
composite instance errors
1. Using Logs
Slide 22 of 64 © |
Random crashes immediately after go-live
Only happened in Production
No warning signs
Error does not appear on the EM console
Example: Infrastructure Error
<Aug 5, 2013 12:00:02 AM EDT> <Error><oracle.soa.bpel.engine.dispatch> <BEA-000000>
<failed to handle message
javax.ejb.EJBException: EJB Exception:
java.lang.StackOverflowError...
Slide 23 of 64 © |
Often easy to distinguish
Should be handled by the code
Shows as a faulted instance on the EM console
Example: Business Fault
<Aug 6, 2013 10:10:33 AM EDT> <Error><oracle.soa.mediator.serviceEngine> <BEA-000000>
<Got an exception:
oracle.fabric.common.FabricInvocationException:
javax.xml.ws.soap.SOAPFaultException:
Message: Organization 129024 not found. Stack trace: at
Core.WebServices.Message.MessageWebService.SaveNotification(O
rganization organization, Notification notification) in
c:\Data\1.0\Core\Message\MessageWebService.svc.cs:line 100,
detail=javax.xml.ws.soap.SOAPFaultException:
Slide 24 of 64 © |
Thrown by external system
No action needed
Shows as a faulted instance on the EM console
No action needed; follow up with target system
Example: System Fault (but not your fault!)
<Aug 6 , 2013 10:10:33 AM EDT> <Error> <oracle.soa.mediator.serviceEngine> <BEA-000000>
<Got an exception:
oracle.fabric.common.FabricInvocationException:
javax.xml.ws.soap.SOAPFaultException:
CreateCustomer failed with Message: Cannot insert the value
NULL into column 'CustomerID', table '@Customers'; column
does not allow nulls. INSERT fails.
Slide 25 of 64 © |
The infamous and ever misleading “Unable to access the
following endpoints” error
Example: System Fault
Slide 26 of 64 © |
In this case, due to:
Message send failed:
sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBu
ilderException: unable to find valid
certification path to requested target
Example: System Fault
Slide 27 of 64 © |
Just an infrastructure warning
Threads would eventually clear themselves up
Does not show on the EM console
Due to failed transaction that continues to retry
Example: Coding or Infrastructure Problem?
<Sep 30, 2013 11:30:04 PM EDT> <Warning><oracle.integration.platform.instance.store.async> <BEA-000000>
<Unable to allocate additional threads,
as all the threads [10] are in use.
Threads distribution :
Fabric Instance Activity = 1,Fabric-Instance-Manager = 9,>
Slide 28 of 64 © |
A lot more information is logged in the soa_server1-
diagnostic.log file
Modifying Logger Levels
Slide 29 of 64 © |
A lot more information is logged in the soa_server1-
diagnostic.log file
Modifying Logger Levels
[2012-01-01T22:35:56.144-05:00] [soa_server1] [TRACE] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-3f1527ec:13487d1ea4c:-8000-0000000000000fe1,0:2]
JmsProducer_execute:[default destination = jndi/CustomerJMSQueue]:
Successfully produced message.
[2012-01-01T22:35:56.256-05:00] [soa_server1] [NOTIFICATION] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-5675273b:1348cccad75:-8000-0000000000055743,0]
JMSAdapter JMSConsumer JMSMessageConsumer_consume: Got message with ID
ID:<458362.1325475356144.0> from destination jndi/CustomerJMSQueue
[2012-01-01T22:35:56.261-05:00] [soa_server1] [TRACE] [] [oracle.soa.adapter]
[ecid: cb680017c6a0acfe:-5675273b:1348cccad75:-8000-0000000000055743,0]
JMS Adapter JMSProducer:CustomerJMS [ CustomerProduce_ptt::CustomerProduce(body)
] XMLHelper_convertJmsMessageHeadersAndPropertiesToXML:
<JMSInboundHeadersAndProperties xmlns="http://xmlns.oracle.com/pcbpel/
adapter/jms/">[[
<JMSInboundHeaders>
<JMSMessageID>ID:<458362.1325475356144.0></JMSMessageID>
<JMSTimestamp>1325475356144</JMSTimestamp>
Slide 30 of 64 © |
When a managed server goes into warning state, what are
you supposed to do?
2. Using Thread Dumps
Slide 31 of 64 © |
Navigate to Servers > (managed server) > Monitoring >
Threads
Understanding Stuck Threads
Slide 32 of 64 © |
AdminServer.log
bam_server1.log
Understanding Stuck Threads
####<Dec 23, 2011 6:03:49 PM EST> <Error> <WebLogicServer>
<soahost1> <AdminServer> <BEA-000337> <[STUCK] ExecuteThread: '0'
for queue: 'weblogic.kernel.Default (self-tuning)' has been busy
for "658" seconds
####<Dec 23, 2011 5:53:36 PM EST> <Error> <JMX> <soahost1> <bam_
server1> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.
Default (self-tuning)'> <<WLS Kernel>> <> <> <1324680816405> <BEA-
149500> <An exception occurred while registering the MBean
com.bea:Name=AdminServer,Type=WebServiceRequestBufferingQueue,
WebServiceBuffering=AdminServer,Server=AdminServer,
WebService=AdminServer. java.lang.OutOfMemoryError: PermGen space
Slide 33 of 64 © |
1. We found AdminServer to be in the “Warning” state, due
to a stuck thread.
2. We confirmed that there was indeed a stuck
“ ExecuteThread ” as shown on both the Oracle
WebLogic Administration Console and theAdminServer.log file.
3. By reviewing the soa_server1.log and
bam_server1.log files, we found startup errors in the
BAM server log.
4. The BAM server was unable to register an AdminServerMBean due to the java.lang.OutOfMemoryError
exception that was thrown.
Understanding Stuck Threads
Slide 34 of 64 © |
PERFORMANCE ISSUES
Slide 35 of 64 © |
Is logging in to Oracle Enterprise Manager Fusion
Middleware Control extremely slow?
Are all composite instances completing in an unusually
longer period of time?
Are the logs or your dehydration database growing
unusually quickly?
Are you seeing an exceptionally high number of errors in
the logs?
Server Wide Performance Issues
Slide 36 of 64 © |
root@soahost1:/root> df –m
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda8 996 451 494 48% /
/dev/sda9 815881 697454 76314 91% /u01
/dev/sda7 996 36 909 4% /home
/dev/sda5 1984 138 1744 8% /tmp
/dev/sda3 1984 283 1598 16% /var
/dev/sda2 5950 3842 1802 69% /usr
/dev/sda1 99 12 83 13% /boot
tmpfs 8023 0 8023 0% /dev/shm
Check available disk space
Often an overlooked area
Slide 37 of 64 © |
The vmstat or TOP command easily outputs CPU,
memory, and I/O statistics
Do not rely on Linux’s reporting of available memory, and
best to look at SWAP space usage
Why Linux reports 100% memory usage all the time ???
Check CPU, RAM, and I/O
root@soahost1:/root> vmstat -S m
procs -------memory--------- --swap-- ---io-- --system-- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 59 402 15055 0 0 2 16 0 0 2 2 96 1 0
Slide 38 of 64 © |
System log files can reveal resource issues:
Check OS Resources
root@soahost1:/root> cat /var/log/messages
Aug 31 20:53:22 uslx286 sshd[22480]: fatal:
setresuid 10000: Resource temporarily unavailable
root@soahost1:/root> ps -A | wc -l
297
root@soahost1:/root> lsof | wc -l
6064
Too many open files can exhaust system resources:
Too many running processes can exhaust system resources:
Slide 39 of 64 © |
For performance, consider the following:
Switching from Sun JDK to JRockit JDK
Optimizing JVM settings
Additional JVM performance tuning documentation from
Oracle can be found at:
http://docs.oracle.com/cd/E23943_01/web.1111/e13814.pdf
http://docs.oracle.com/cd/E15289_01/doc.40/e15060.pdf
JVM Performance Tuning
Slide 40 of 64 © |
Add this to the PORT_MEM_ARGS, argument in thesetSOADomainEnv.sh(.cmd) script
-XX:+HeapDumpOnOutOfMemoryError
Although this is not a performance setting, I recommendsetting it to dump the heap to an hprof file when
java.lang.OutOfMemoryError exceptions are thrown
This is useful for later analysis and troubleshooting
JVM Logging
Slide 41 of 64 © |
Ensuring that the heap allocated to the JVM is appropriately
sized (that is, comparing heap versus non-heap usage)
Ensure that there is no excessive garbage collection
Monitor JVM thread performance
Check JVM
Slide 42 of 64 © |
Data source errors are usually easy to identify – when
exhausted, errors show up everywhere
Check Data Sources
Slide 43 of 64 © |
Involve a DBA,who is familiar with the Platform.
Check Database Performance
Slide 44 of 64 © |
Navigate to Monitoring > Performance Summary
Can choose metrics to display for any composite
Viewing Performance Summary Graphs
Slide 45 of 64 © |
Right-click on Monitoring > Request Processing
Utilizing SQL queries is so much better
Viewing Request Processing Metrics
Slide 46 of 64 © |
Remember SQL output from last page?
Let’s also get the invoke durations
Composite Instance Performance
SELECT
composite_instance_id,
composite_creation_date,
component_name,
action,
component_state,
TO_CHAR((TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),12,2))*60*60) +
(TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),15,2))*60) +
TO_NUMBER(SUBSTR(TO_CHAR(updated_time-created_time),18,4)),'999990.000') duration
FROM
mediator_instance
WHERE
component_name = 'Order.Create’
Slide 47 of 64 © |
DEPLOYMENT ISSUES
Slide 48 of 64 © |
Involves:
1. Compilation
ant -f ant-sca-package.xml package -
DcompositeDir=$CODE/HelloWorld -
DcompositeName=HelloWorld -Drevision=1.0
2. Deployment
ant -f ant-sca-deploy.xml deploy -
DserverURL=$SOAURL/soa-infra/deployer -
Duser=$USERNAME -Dpassword=$PASSWORD -
DsarLocation=$CODE/HelloWorld/deploy/sca_HelloWorl
d_rev1.0.jar -Dpartition=default -Doverwrite=true
-DforceDefault=true
Understanding the Ant Deployment Process
{we are not using Ant..but having this info won't hurt}
Slide 49 of 64 © |
Compilation done via the package target in ant-sca-
package.xml
The package target calls other targets to perform:
1. Cleanup
2. Validation
3. Compilation
Understanding the Ant Compilation Process
Slide 50 of 64 © |
Removes any existing SAR files
Compilation: The init Target
clean:
[echo] deleting
/u01/svn/HelloWorld/deploy/sca_HelloWorld_rev1.0.jar
Slide 51 of 64 © |
Sets environment variables and validates all resources
within the code
Compilation: The scac-validate Target
scac-validate:
[echo] Running scac-validate in
/u01/svn/HelloWorld/composite.xml
[echo] oracle.home =
/u01/app/oracle/middleware/Oracle_SOA1/bin/..
[input] skipping input as property compositeDir has already
been set.
[input] skipping input as property compositeName has already
been set.
[input] skipping input as property revision has already been
set.
Slide 52 of 64 © |
Compiles the code
Compilation: The scac Target
scac:
[scac] Validating composite "/u01/svn/HelloWorld/composite.xml"
[scac] error: location
.
Load of wsdl "HelloWorldWebService.wsdl with Message part
element undefined in wsdl [file:/u01/svn/HelloWorld/
.
[echo]
[echo] ERROR IN TRYCATCH BLOCK:
[echo] /u01/scripts/build.soa.xml:112: The following
error occurred while executing this line:
.
[echo] /u01/app/oracle/middleware/Oracle_SOA1/bin/ant-sca-
compile.xml:269: Java returned: 1 Check log file : /tmp/out.err
for errors
Slide 53 of 64 © |
Understand that ant runs on the client machine, not the SOA
server[echo] /u01/app/oracle/middleware/Oracle_SOA1/bin/ant-sca
deploy.xml:188: java.lang.OutOfMemoryError: PermGen space
Compilation errors, check out.err and understand adf-
config.xml
oracle.fabric.common.wsdl.SchemaBuilder.loadEmbeddedSchemas
(SchemaBuilder.java:492) Caused by: java.io.IOException:
oracle.mds.exception.MDSException: MDS-00054: The file to be
loaded oramds:/apps/Common/HelloWorld.xsd does not exist.
Deployment errors are usually straightforward[deployComposite] INFO: Creating HTTP connection to
host:soahost1, port:8001
[deployComposite] java.net.UnknownHostException: soahost1
Types of Errors
Slide 54 of 64 © |
Located in Unix/Linux:
/tmp/out.err
Located in Microsoft Windows:
C:\Users\[user]\AppData\Local\Temp\out.err
Location of out.err
Slide 55 of 64 © |
OTHER STUFF
Slide 56 of 64 © |
DMS Spy Servlet displays instant Dynamic Monitoring
Service (DMS) related metrics
Navigate to http://<host>:<soaport>/dms/Spy
http://docs.oracle.com/cd/E15586_01/core.1111/e10108/monitor.htm#CFAHIAIB
The DMS Spy Servlet
Slide 57 of 64 © |
The EDN Database Debug Log can be accessed at:
http://<host>:<soaport>/soa-infra/events/edn-db-log
Changing the oracle.integration.platform.blocks.event.saq
logger to TRACE:32 captures the body of the event
message is available in the EDN trace
Check Event Delivery Network (EDN)
Slide 58 of 64 © |
SUMMARY
Slide 59 of 64 © |
Troubleshooting is part politics, part product knowledge
Oracle SOA Suite 11g errors can mostly be classified into:
Runtime (or infrastructure) errors
Performance issues/errors
Deployment errors
Summary
Slide 60 of 64 © |
For infrastructure errors:
Identify whether it is a composite or an infrastructure error
Consider increasing logger levels
Identifying the root cause of stuck threads may require some
drill-down investigation
Summary
Slide 61 of 64 © |
For performance issues:
Identify whether it is a server-wide performance issue, or
specific to a single composite
Check overall system health, even the obvious areas
Obtaining composite instance performance metrics is easily
done through SQL,In case of OSB/Paris run SOAP UI unit tests.
Summary
Slide 62 of 64 © |
For deployment errors:
Understand the ant compilation (i.e., packaging) and
deployment processes
Understand adf-config.xml
Summary
Slide 63 of 64 © |
Oracle SOA Suite 11g Administrator’s
Handbook
http://www.packtpub.com/oracle-soa-suite-11g-
administrators-handbook/book
Chapter 6: Troubleshooting the Oracle
SOA Suite 11g Infrastructure
“Highly recommended
Book
Slide 64 of 64 |
Amit DeoSenior Consultant
Contact Information