Sunday, July 19, 2009

Performance Optimization of Oracle App server 10gR3 with SOA Suite (BPEL Engine) on Sun’s CMT architecture

Performance Optimization of Oracle App server 10gR3 with SOA Suite (BPEL Engine) on Sun’s CMT architecture

1. Summary

Performance was improved by about 50% by making several changes at the Solaris OS, Oracle RAC DB and Sun JVM. The methodology used and details of performance tuning that were applied to achieve this remarkable result are explained below.

2. Architecture information

§ Oracle SOA suite with Oracle BPEL

§ Custom developed telecom application written in Java and deployed on Oracle Application server 10gR3 on the Sun JVM

§ The application tier is running on several Sunfire T5120 servers

§ The backend database is a two node Oracle RAC running on M5000’s.

§ All servers running Solaris 10 OS

§ Web Server: Apache

3. Problem definition/Assessment

The application running within the JVM Oracle App server is performing poorly. The poor performance is quantified as individual business transaction response times being higher than acceptable. There are four business transactions that are of interest here:

  1. Contract Pay send Action
  2. Query
  3. Subno search
  4. SUB pay send Action

The response times as measured from the end user perspective for the Load test were said to in the range of 15 seconds when running the application tier on the Sunfire T5120 server. The customer expects this to be in the range of about 7 to 9 seconds

The response times as measured from the end user perspective during the Stress test were as shown in the table below. The ‘ave’ column shows the average response times for the transactions as measured during steady state when 100 users are running concurrently

Measurement

Min.

Ave.

Max.

SD

CNTRCT PAY Send Action

9.287

44.398

152.873

25.791

Query

60.365

144.306

305.685

48.307

Subno search

12.671

54.118

165.503

26.178

SUBS PAY Send Action

6.484

34.489

91.365

17.847

A detailed assessment of current architecture and configuration was conducted and the following reports were collected and analyzed from the test systems

§ Solaris system kernel parameters on app tier and db tier

§ Oracle app server config ( opmn logs)

§ JVM config and options

§ Sun Explorer output

§ Oracle Remote Diagnostic Agent (RDA) output

§ Loadrunner output from load test and stress test

§ Application deployment logs showing GC

§ jconsole logs

§ opmn.xml

§ vmstat

§ prstat

§ ndd network parameters

§ apache settings

4. Performance Tuning

After assessing the problem and the environment, the 100 user stress test was chosen as the metric to be used for the performance tuning engagement on the T-series servers from Sun Microsystems. Reasons for choosing this instead of the load test were:

1. T-series servers are built using the CMT architecture, this architecture is meant to provide best performance to applications and workloads that are multithreaded and exhibit high concurrency, the 100 user stress test simulated using loadrunner tool created such a test case

2. Additionally the Telco app would be running thousands of concurrent users in production once deployed

3. Performance testing is about how much work was done in given amount of time, it’s a throughput measurement. Atomic transactions do not represent production type scenarios.

A baseline was created with the current settings using the 100 user stress test. Various types of tuning were applied to the environment, while collecting statistics for each change made. Below is a run down of what tuning was done at the Server, network, Java and OS level to achieve about 40% improvement in response times.

Table 1: BASELINE – 100 user stress test on T5120

Measurement

Min.

Ave.

Max.

SD

CNTRCT PAY Send Action

9.287

44.398

152.873

25.791

Query

60.365

144.306

305.685

48.307

Subno search

12.671

54.118

165.503

26.178

SUBS PAY Send Action

6.484

34.489

91.365

17.847

Current settings

java-options" value="-server -Xms2048m -Xmx2048m -XX:PermSize=256m
-XX:MaxPermSize=256m -XX:NewSize=512m -XX:MaxNewSize=512m
-Dcom.sun.management.jmxremote
-Djava.security.policy=$ORACLE_HOME/j2ee/FE_
TRN/config/java2.policy
-Djava.awt.headless=true -Dhttp.webdir.enable=false

-XX:+DisableExplicitGC
-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

-XX:+UseTLAB
-XX:LargePageSizeInBytes=4m -XX:AppendRatio=3

-Djava.net.preferIPv4Stack=true
-Dajp.keepalive=true

-Doracle.oc4j.trace.fine=com.evermind.server.http
-Doracle.dms.transtrace.ecidenabled=true -verbose:gc

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -Xloggc:fe_trn_gcdetails.log
-Dsun.rmi.dgc.server.gcInterval=7200000

Recommended settings

Options: -server -Xms3400M -Xmx3400M -Xmn2000m -Xss128k -XX:+UseLargePages
-XX:LargePageSizeInBytes=4m -XX:+AggressiveHeap -XX:+UseParallelGC
-XX:ParallelGCThreads=32 -verbose:gc -XX:+PrintGCDetails

-XX:+PrintGCTimeStamps -XX:-TraceClassUnloading

-XX:+UseParallelOldGC
-Doracle.dms.sensors=none -Doc4j.jms.implementation=oc4j.j2ee.jms
-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.DevPollSelectorProvider

System Tuning in /etc/system:

set set kernel_cage_enable = 0
set ip:ip_soft_rings_cnt = 16 (this is currently 8)

Network Tuning:


ndd -set /dev/tcp tcp_conn_req_max_q 16384
ndd -set /dev/tcp tcp_conn_req_max_q0 16384
ndd -set /dev/tcp tcp_xmit_hiwat 131072

ndd -set /dev/tcp tcp_recv_hiwat 131072
ndd -set /dev/tcp tcp_naglim_def 1

Table 2: Result from 100 user stress test after above changes

Transaction Name

SLA Status

Minimum

Average

Maximum

Std. Deviation

90 Percent

CNTRCT PAY Send Action

8.903

41.651

346.288

46.598

94.784

Query

34.665

135.605

583.347

98.061

199.261

Subno search

6.666

55.752

561.145

89.413

95.247

SUBS PAY Send Action

8.166

29.013

106.795

17.925

43.028

Minor improvement was observed in the average response times compared to baseline.

After analyzing this result of 100 user stress test, the following observations were made:

1) Majority of the response time is spent in the J2EE application

2) A single thread of the Java application is consuming lots of CPU, out of 200 threads on each JVM instance about 2 of them are saturating CPU

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
18845 oratabs 84 1.8 0.0 0.0 0.0 14 0.0 0.1 581 103 4K 0 java/214
18845 oratabs 81 1.0 0.0 0.0 0.0 17 0.5 0.1 398 106 2K 0 java/208

3) Heavy locking and CPU spins are occurring within the Java threads as evidenced by the number of calls to lwp_mutex_wakeup and lwp_mutex_lock in the stack trace collected.

In order to alleviate the lock contention and thereby reduce pressure on CPU by a single thread within a process, the following recommendations are suggested:

1) Increase number of JVM instances from 3 to 8 with a 2.6 GB heap on each. Run the same 100 user stress test and collect the usual performance metrics for analysis.

2) Based on successful test using recommendation (1) above, next optimization would be to use Solaris mtmalloc. This is an alternate memory allocator available on the Solaris OS by default. Multithreaded code can improve performance and scalability on a multiprocessor system (like the T-series) using a multithreaded memory allocator.

Table 3: Result from 100 user stress test after increasing number of JVM instances from 3 to 8 and reducing Heap size from 3.4 to 2.6GB

Transaction Name

SLA Status

Minimum

Average

Maximum

Std. Deviation

90 Percent

CNTRCT PAY Send Action

9.142

28.828

102.244

12.888

44.772

Query

32.824

106.413

207.711

36.032

160.511

Subno search

5.248

26.046

117.339

18.507

49.692

SUBS PAY Send Action

6.753

26.161

100.532

15.15

38.917

Substantial decrease in the average response times (4th column in the table above) was observed with these changes, when compared to the baseline numbers.

Next test was done by enabling mtmalloc on all of the JVM processes. The procedure followed to enable this is detailed below:

  1. Find the script which starts the java processes.
  2. Edit the file or script
  3. Add the line LD_PRELOAD=/usr/lib/libmtmalloc.so.1
  4. Save the file and re-start the application services.
  5. After app restarts then verify if Mtmalloc is enabled by executing the command for all the app services

$%pldd -p | grep -i mtmalloc


Below is an example within a korn shell script

#!/bin/ksh
LD_PRELOAD=/usr/lib/libmtmalloc.so.1
export LD_PRELOAD

exec $@

Table 4: Result from 100 user stress test after enabling mtmalloc :

Transaction Name

SLA Status

Minimum

Average

Maximum

Std. Deviation

90 Percent

CNTRCT PAY Send Action

8.739

27.283

108.846

14.514

45.502

Query

36.215

98.267

263.413

37.959

151.543

Subno search

6.36

23.639

115.511

17.301

44.259

SUBS PAY Send Action

7.431

26.969

113.977

17.249

39.792

Significant decrease in average response times were observed at this stage of the performance testing. The table 5 below shows the side by side comparison, evidently now some of the transactions were performing about 50% faster.

Table 5: Comparison of response times between baseline and table 4

Transaction Name

Avg Response time before tuning

Avg Response time after tuning

% decrease in response time

CNTRCT PAY Send Action

44.398

27.283

38%

Query

144.306

98.267

32%

Subno search

54.118

23.639

57%

SUBS PAY Send Action

34.489

26.969

23%

5. Performance comparison of the application on Sunfire T5120 versus M5000 hardware

In order to measure the difference in performance of the application between T5120 and M5000, two 100 user load tests were performed. The same Oracle RAC database was used in the backend, the only difference was the application tier machine was switched from a T5120 to a M5000.

The average response times for the 4 business transactions was found to be better on the M5000 compared to the T5120. See results of testing in tables below.

Table 9: 100 user stress test results on T5120 after all of the Tuning was applied

Transaction Name

SLA Status

Minimum

Average

Maximum

Std. Deviation

90 Percent

CNTRCT PAY Send Action

8.739

27.283

108.846

14.514

45.502

Query

36.215

98.267

263.413

37.959

151.543

Subno search

6.36

23.639

115.511

17.301

44.259

SUBS PAY Send Action

7.431

26.969

113.977

17.249

39.792

Table 10: 100 user stress test results on M5000

Transaction Name

SLA Status

Minimum

Average

Maximum

Std. Deviation

90 Percent

CNTRCT PAY SEND ACTION

1.772

12.721

156.069

14.583

25.244

DESKTOP SUBNO SEARCH

0.883

23.743

191.891

27.233

52.43

QUERY BUTTON

4.389

33.887

184.868

34.58

80.255

SUBS PAY SEND ACTION

1.856

10.245

83.143

11.218

21.005

Table 11: side by side comparison of average response times between T5120 and M5000

Transaction Name

Avg Response T5120

Avg Response time – M5000

CNTRCT PAY Send Action

27.28

23.74

Query

98.26

33.88

Subno search

23.63

12.72

SUBS PAY Send Action

26.96

10.24


The table below shows the settings that were in place for the comparison testing between T5120 and M5000

7. Conclusion

After assessing, tuning, testing and analysis of the application from a platform perspective, the average response time during a 100 user stress test was reduced by over 50% on T5120. However these numbers are still inferior when compared to the same test done on a M5000. The table below provides a side by side comparison of the key results.

Transaction Name

Avg Response T5120

Avg Response time T5120 after tuning

Avg Response time – M5000

% decrease in response time due to tuning T5120

CNTRCT PAY Send Action

44.398

27.283

23.74

38%

Query

144.306

98.267

33.88

32%

Subno search

54.118

23.639

12.72

57%

SUBS PAY Send Action

34.489

26.969

10.24

23%

The application performs faster on the M5000 compared to T5120. It can be said that all tuning has been completed from a OS (Solaris) and Platform (Sun CMT architecture) level. In order to further improve the performance of this application on T5120 then application code (Java + SQL ) level tuning has to be done. This is normally done by attaching profiling tools to the code and re-writing some pieces of it so it's parallelism can be increased, hence making it run faster on the T5120 type of architecture. The application level tuning will also involve changes to the SQL queries being executed from the application to the db.

Additional points to be considered

The T5120 being used is Sun’s first generation CMT (chip multithreading) server, Sun has since released a much improved third generation CMT server (T5440).

T5120 should be counted as 1 CPU@1.2Ghz while comparing to the M5000 with 8 CPU’s @ 2.2Ghz

The M5000 is better suited for long running single threaded applications, while the T5120(CMT) is better for highly concurrent short transactions.

T5120 is 1/8th the cost of M5000. Hence even if application is performing better on the M5000, it would cost 8 times more to procure and maintain a M5000 compared to T5120.

Another test to be done is to run the same 100 user stress test on a Sunfire T5440 server. This server is part of the same product family as the T5120, however it’s the latest model and has some key features which could result in yielding better performance of the application. It uses the UltraSPARC T2+ processor and has higher frequency of 1.4Ghz. The next section in this document provides details on the sun servers and their differences.

3 comments:

Unknown said...

Khader;
Very interesting and thorough information on performance optimization. I don't know if you have noticed but the last table 11 with final settings, doesn't show up for some reason.
Thanks
Abbasi

Unknown said...

this exactly what we saw in our environment with Websphere

Unknown said...

this is the same behaviour what we saw in our environment with Websphere