Skip to main content

Profiler Survey - JMC, Solaris Studio, FlameGraph and VTune

Introduction:

In this post, I will compare the commonly used java profilers(Java Flight Recorder, Solaris Studio, FlameGraph and VTune) in terms of usage, overhead and data presented. The observation is based on default collections. If more data are collected, the observations may change.

I run SPECjbb2015 at a prefixed IR for 5 minutes. During this 5 minutes, I collect system level statistics, SPECjbb2015 console output, and collect profiles for 60 seconds (except for VTune).

Software and Hardware 

JDK8u144
Ubuntu 16.04.1 LTS
Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz, 2 Socket, 8 cores per Socket


Java Flight Recorder(JFR):

JFR gives Java level profile, and some JVM internal statistics. 
There are 2 ways to start JFR. From Java command line, or from jcmd.

Command Line: 

From command line -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:StartFlightRecording=parameter1=value1[,parameter2=value2]

  Parameters for -XX:StartFlightRecording (refer oracle document)


ParameterDescriptionDefault Value
compress=true|falseSpecifies whether to compress the Flight Recorder recording log file ( the .jfr file) on the disk using the gzip file compression utility. This parameter is valid only if the filename option is specified.false
defaultrecording=true|falseSpecifies whether the recording is continuous or it runs for a limited time.false
delay=timeSpecifies the time to elapse during run time before starting the recording.0
duration=timeSpecifies the time for the recording.Unlimited
filename=nameSpecifies the name of the Flight Recorder recording log file (the .jfr file).None
name=identifierSpecifies an identifier for the Flight Recorder recording.Recording x (for example, Recording 1, Recording 2)
maxage=timeSpecifies the maximum age of disk data for default recording.15 minutes
maxsize=sizeSpecifies the maximum disk space for a recording. This parameter is valid only for the size-bound recordings.Unbound
settings=eventfileSpecifies the event setting file to use for the recording. You can use this parameter multiple times in one command.jre/lib/jfr/default.jfs

Example
java -XX:+FlightRecorder -XX:FlightRecorderOptions=disk=true,maxchunksize=10M -XX:StartFlightRecording=filename=test.jfr myApp

JCMD:

Start jvm with -XX:+UnlockCommercialFeatures -XX:+FlightRecorder
find java pid (for example, jps, or ps -ef|grep java)
jcmd <pid> JFR.start parameter1=value1 parameter2=value2...

Data Presented:

JFR gives Java level profile, and some JVM internal statistics. The data collected can be changed by changing the JFR configuration file. Under JDK, there are 2 config files. For example, for JDK8, 
jdk1.8.0_112/jre/lib/jfr/default.jfc  and jdk1.8.0_112/jre/lib/jfr/profile.jfc.
By default, default.jfc is used. If setting=<path_to_other_file> is used, you can point JFR to customized configuration file.

With JDK8, memory usage (GC, allocation), cpu profile (hot method) and contention (Threads->Contention and Latencies) are presented for investigation. Under Events tab, each event is plotted by threads.

JDK9 gives much more JVM internal statistics. It definitely helps performance analysis, but may have more overhead.



Solaris Studio:

Oracle Solaris Studio works best with Solaris. It can do some hardware profile as well.


Download

  1. Be sure to download the latest stable version.
  2. You can download Studio http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html.
  3. Follow the instructions on those download page to unpack the package to where you want sunstudio installed.
  4. add <path_to_sunstudio>/bin to $PATH

How to use:

  1. Download the jdk you want to profile.  You need to download both the jdk and the debug bundles so that sunstudio can find both.  
    1. As an example: http://jre.us.oracle.com/java/re/jdk/9/promoted/ea/b44/bundles/linux-x64/, download jdk-8u40-ea-linux-x64.tar.gz and jdk-8u40-ea-bin-b20-linux-x64-debuginfo-31_dec_2014.zip.   jdk-8u40-ea-linux-x64.tar.gz is the product bundle.   jdk-8u40-ea-bin-b20-linux-x64-debuginfo-31_dec_2014.zip is the debuginfo bundle.
  2. Run your application and collect the profile.  One example for collecting the profile:
    • collect -p hi -t 3600-4800s -j on -A off -o /tmp/test.er ${JDK} -d64 -cp ${JAR_NAME} HelloWorld
      • -p option: collect clock based profiling data. option could be off|on(10 ms)|lo(100 ms)|hi(1ms)|n(n can be integer or floating-point number, with a suffix of u for values in microseconds, or m for values in milliseconds.  With no suffix, the value is in ms).
      • -t duration: Collect  data  for the specified duration.  duration can be a single number, followed by either m, specifying minutes, or s, specifying seconds (default), or two such numbers separated by a -sign.   If  one  number  is  given, data is collected from the start of the run until the given time; if two numbers are given, data is collected from the first time to the  second.   If  the second  time  is zero, data is collected until the end of the run.  If two non-zero numbers are given, the first must be less than the second.
      • -j on: when target is jvm, turn java profiling on. Default is on.
      • -A option: Control whether the loaded objects used by target process should be archived into the recorded experiment. Option can be on (copy load objects into the experiment. src, in addition to load objects, also copy all source files and .anc files. used, in addition to load objects, also copy all source files and .anc files that are referenced in the recorded data and can be found.  To minimize the overhead we usually use "-A off", and later archive the experiment with sources. 
        Archiving during profiling can generate significant overhead if there are many processes. In this case each of them is archived separately at the end of its execution, and this takes some time. Also if any of them crashed, their subexperiments will not be archived at all.
      • -o experiment_name: use this name as the experiment name
      • -d directory_name: place the experiment in the directory
For more information on collect 'man collect'.  If collect ends normally, the experiment will be archived when collect finishes.

Data Presented:

Usually I check the results under Expert level. This gives both Java and JVM internal profiles. This is very close to JFR reports in terms of hot spot. If you configure the path to source code, you can see the source code and disassembly.

Later version gives flame chart, so that you have a better big picture:


Q&A:

  1. How to connect the experiment and the hotspot source code? 
    1. When taking the profile, please make sure libjvm.debuginfo is under the same directory as libjvm.so.  libjvm.debuginfo can be downloaded from the build bundles.
    2. If you have hotspot source code, you can map the source code and archive them.  For example, assuming the hotspot source is under
/export/home/nikm/Yu.Zhang/jdk/jdk8u40b20/hotspot, in the experiment, it looks for  /mypool/HUDSON/workspace/8-2-build-solaris-sparcv9/jdk8u40/2522/hotspot. Under your home directory, create .er.rc and add:
pathmap /mypool/HUDSON/workspace/8-2-build-solaris-sparcv9/jdk8u40/2522/hotspot /export/home/nikm/Yu.Zhang/jdk/jdk8u40b20/hotspot
Then you can archive the experiment with the source: er_archive -s all jz.1.er > ea.log 2>&1
     2. How to trace synchronization w/o too much overhead? 
There are 2 types of synchronization: JVMTI - java synchronization, and native synchronization (mutex, semaphore).  For the java type, the profiler leaves a call back function.  This could be expensive.  For native call, the profiler uses interposition (go to the library, replace the matching function with special function). It may add a call frame.  The timestamp part could be expensive.
On Solaris, no need to turn tracing synchronization on.  On Linux, if we are not interested in java level synchronization, we can turn collecting java off, and do tracing. 
Watch out for spinning lock.  As they take cpu cycles.
Identifying c/c++ Lock time:
  Solaris: collect -p [on|hi|lo]
      Note: Look for OS Lock time.  However, be aware that spin-locks will
      show up as User CPU Time, not Lock Time.  If such cases are a problem, try
      the Linux suggestion below.
  Linux:  collect -j off -s on
      Note: for c/c++ profiling, we'll turn off Java profiling (-j off) to
      avoid the significant overhead associated with tracing Java-synchronization primitives.
     3. How to identify Total CPU time:
       Identifying Total CPU time:
       <any OS>: collect -p [on|hi|lo]
     4. How to identify System Time:
        Solaris: collect -p [on|hi|lo]
     Linux: collect -h cycles,[on|hi|lo] -h cycles~system=1~user=0,[on|hi|lo]
     The hardware counters will identify User vs System time as recorded by the CPU.
     3. Is -p on or -p lo enough to catch gc thread work imbalance: -p lo sampling interval is 100,000 microsec(100ms), The gc events we want to catch is < 100ms.  So we can not catch it. For example, log_144, timestam 4038.0294038.094.
     4. How to narrow down to the gc threads
    • G1ParallelGCThread: Under Functions, look for G1ParTask::work, add filters to include only stack that containing the selected function.  After that go back to Timeline, only G1ParallelGCThreads should show up.  If you know a function that you are interested, you can click that function, to to menu call stack function color, change the default color for that function so it stands out.
    • ConcurrentG1RefineThread:
    • CMConcurrentMarkingTask::work
     5. On Linux, does the hotspot point to a wrong place?  Looking at source/disassembly for add_reference, 
Does it mean the method returned at line 540 has incl. Total cpu of 185.100.  Why at source it is 195.700. How to dig into that method?
If we follow SparsePRT::add_card(line 465) call SparsePRT::add_card->RSHashTable::entry_for_region_ind_create, the source is unknow?
If we look at HeapWord* HeapRegion::oops_on_card_seq_iterate_careful, From the source, there are 2 hotspots: line 455 and others.  From Disassembly, line 720, line 71, But form the logic, we know it should be line 504. 
     5. when to turn -h on? The code sampling should be good enough. Usually we leave this off.  Unless we need to investigate cpu stall.
         For example, if you are interested in TLB miss estimated stalls, L3 miss estimated stalls, CPU user time, and CPU system time which includes DTLB trap time, you might look at:
             collect -h dtlbm_stall -h l3m_stall -h cycles -h cycles~system=1 <executable name>
     6. can we catch when the application throughput drop? can we identify that disk activity is high? No.

Flame Graph:

FlameGraph gives a whole picture in Java and lower level (C++, OS).

To profile Java application, you need
perf-map-agent

Command Line: 

When starting JVM, -XX:+PreserveFramePointer has to be used so that frame pointers are preserved (https://medium.com/netflix-techblog/java-in-flames-e763b3d32166).

How to use: 

Download and install FlameGraph from https://github.com/brendangregg/FlameGraph
Download and install Perf-Map-Agent from https://github.com/jvm-profiling-tools/perf-map-agent

Set some environment variables for Perf-Map-Agent, for example
export FLAMEGRAPH_DIR=<path_do_FlameGraph>
export PERF_RECORD_SECONDS=60
Start java application, find the pid for the java process.

perf-java-flames <pid> <perf-record-options> 

For Example: perf-java-flames <pid> -a -g

Data Presented: 

Comparing FlameGraph to FlameChart generated by Solaris Studio, FlameChart includes only java level information, while FlameGraph has additional jvm internal and OS level information.

VTune:

VTune is expensive, both money wise and overhead wise. But it provides profile information from Java to hardware level. It can also link the hotspot with source or disassembly very easily.

To check kernel config, 
cat "/boot/config-`uname -r`" | less
echo 0|sudo tee /proc/sys/kernel/yama/ptrace_scope
echo 0 > /proc/sys/kernel/nmi_watchdog
Skipping VTune results, since most people do not have access to VTune.

Overhead Comparison:

This chart shows SPECjbb2015 Process Rate during the 5 minutes steady state. This is for reference only due to variations. JFR is with lowest overhead, followed by flame graph. JFR is mostly java level only, so it makes sense it has the lowest overhead.
This table lists profile size(in byte) for 60 seconds recording
                jfrsolarisstudioflamegraph
8,577,32011,468,8001,985,601

Conclusion:

In this blog, I compared java profilers: JFR, Solaris Studio, FlameGraph and VTune in terms of usage, overhead, and data collected.
JFR is Jave level profiler, with the least overhead.  It reveals some JVM internal statistics. 
Solaris Studio has relatively bigger overhead. It connects hotspot and source code well. It works best on Solaris.
FlameGraph gives a picture of the whole stack, from OS to Java. The overhead is bigger than JFR.
VTune collects a lot of information, especially hardware related.

Comments

Post a Comment

Popular posts from this blog

Gatling Load Test Tricks and Debug

Introduction: Gatling is a light weighted load generator. Though it is 'load test as code', sometimes it is hard to debug. Especially when there is lack of documentation and tutorial. I run into an issue when upgrading from 2.3.1 to 3.0.2. The error message is not very helpful. Here I will show this case to illustrate an upgrading issue, as well is tricks to print out some debug information. Some Debug Tricks: According to  gatling document , you can print out debug information in 2 ways. change conf/logback.xml to DEBUG/TRACE for logger name="io.gatling.http.engine.response" This generates too much information and my tests times out.  Use println in exec. This is more flexible and can prints the exact information I need. How ever I would like to put the debug messages in a separate file. Here is an example: val file = System.getProperty("file", "/tmp/debug_cookie.out") val scn = scenario(myReadScenario) .during (myDur

Performance Test with Gatling in Cloud

Introduction: A lot of our performance tests are driven by gatling. When moving our tests to Google cloud, we face a lot of questions. One of them is how to manage the life cycle of the test. First, we need to generate the .csv file as the feeder to the gatling test. Second, we need to know when the test is finished. Third, we need to retrieve the results from the cloud. According to  Distributed load testing with Gatling and Kubernetes , a Kubernetes Job should be used. While this blog provides good information, I still need to figure out how to create the feeder .csv and share it with gatling script.  Using InitContainers to pre-populate Volume data in Kubernetes provides me another piece of information. In this blog post, I will show you my experiment of creating a Kubernetes job to drive performance workload with Gatling. Create Feeder script: My gatling script reads the feeder .csv (benchmark.csv). I have a python script generating the benchmark.csv file. 

G1GC Performance Tuning Parameters

G1GC Performance Tuning Parameters In this post, I will list some common observations from G1GC logs, and the parameters you can try to tune the performance. To print GC logs, please refer to  my blog about how to print gc logs . Threads Related In JDK9, with -Xlog:gc*=info, or -Xlog:gc+cpu=info, you can get log entry like: [12.420s][info][gc,cpu      ] GC(0) User=0.14s Sys=0.03s Real=0.02s This can give you some indication about the cpu utilization for the GC pause. For example, this entry indicates for this gc pause, total user-cpu is 0.14s, wall time is 0.02s, and system time is 0.03s. The ratio of User/Real could be used as an estimation of number of gc threads you need. For this case, 18 gc threads should be good. If you see long termination time,  User/Real less than the gc threads, you can try to reduce the number of gc threads. Here is a list of threads related parameters: Name Pa