The Source for Java Technology Collaboration
User: Password:



Osvaldo Pinali Doederlein

Osvaldo Pinali Doederlein's Blog

Mustang's HotSpot Client gets 58% faster!

Posted by opinali on November 10, 2005 at 09:13 AM | Comments (17)

[FIXED: I originally posted wrong data for JSE 6.0 Client b59's: the average score was correct, so it stands that the Client VM became 58% faster; but scores for the individual tests were identical to other VM as I copy&pasted the HTML but didn't change the values. Sorry for the confusion. (I re-checked all other numbers.) Also, I deleted the comments about an apparently  big impact on array-intensive code, because with the new detailed data, this trend doesn't look so big anymore, e.g. the FFT test improves more than matmult and LU but it isn't array-intensive, and SOR is but doesn't improve substantially.]

Mustang is adding new optimizations to HotSpot, not news, we expect the JVM to learn new tricks at every release. But the great news is that Sun just added a great performance enhancement to HotSpot Client. Yep, the other HotSpot... the one that never appears in benchmarks aimed to make Java or particular JVMs look good. Unfortunately, you can't realistically use the Server VM in most client apps, from simple database front-ends to action games, as it loads slower, eats more RAM, and it's not included in the more ubiquitous JRE package.

Well, good news: Build 59 includes an improvement recorded as Bug 6320351: new register allocator for C1. ("C1" means HotSpot Client; C2 is Server.) The bug description is short, so there it goes in full: "The existing register allocator for C1 is very primitive and doesn't take very good advantage of the registers available.  It also doesn't allow values to live across blocks.  Along with the high level IR models locals in such an explicit manner that the code quality for inlining is fairly bad." A good register allocator is even more important in the Intel x86 platform because it's got so few architectural registers.

How cool is this new optimization? I tested with SciMark2, a good benchmark for that kind of stuff because it runs tight loops of complex arithmetic expressions, array access and similar operations that benefit enormously from good register allocation. Results (on Windows / Pentium-IV 2,25GHz) are:


HotSpot Client
HotSpot Server
J2SE 5.0
Update 5
168
FFT 87
SOR 308
MonteCarlo 24
Sparse matmult 116
LU 303
410
FFT 232
SOR 563
MonteCarlo 70
Sparse matmult 333
LU 853
JSE 6.0
build 58
169
FFT 87
SOR 308
MonteCarlo 40
Sparse matmult 119
LU 291
431
FFT 274
SOR 560
MonteCarlo 126
Sparse matmult 289
LU 905
JSE 6.0
build 59
265
FFT 190 (+118%)
SOR 354 (+15%)
MonteCarlo 50 (+25%)
Sparse matmult 250 (+110%)
LU 480 (65%)
432
FFT 272
SOR 560
MonteCarlo 126
Sparse matmult 289
LU 910

Notice the remarkable improvement in HotSpot Client, a 58% boost over 5.0 (the best stable HotSpot) or Mustang b58 (the previous build without the new optimization). The Server VM is also improving in Mustang, but only by a modest 5% over 5.0 (that's already good enough to produce near-ideal code; SciMark2 doesn't need most new tricks of Server 6.0, from lock coarsening to the [still upcoming] benefits of escape analysis).

This all spells good news for Java clients. The improved code generation is good not just for action games with sophisticated geometry calculations. Complex algorithms like Java2D's blt loops, Swing's painting and layout management code, chart renders, XML parsers, middleware stacks and others, lie below our vanilla CRUD & reporting application. HotSpot Client is doomed to always lag behind Server, but Moore's Law allows to continuously package more tricks even in the JVM for low-end machines and user-time sensitive apps.

Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • One interesting thing to note is that the client JVM is actually faster than the server JVM on the "Sparse matmult" benchmark. Any ideas why?

    Posted by: mayhem on November 10, 2005 at 11:18 PM

  • You say that escape analysis is "still upcoming" but the link to escape analysis (http://www-128.ibm.com/developerworks/java/library/j-jtp09275.html) says "the current builds of Mustang (Java SE 6) can do escape analysis"...?

    Posted by: marc_ on November 11, 2005 at 01:17 AM

  • I believe escape analysis is performed in the latest HotSpot builds, but no escape analysis related optimizations are done. Some benchmarks will see a huge improvement (10x) when these optimizations are implemented.

    Posted by: mayhem on November 11, 2005 at 03:56 AM

  • Absolutely awesome!

    Congratulations to all involved. May many more 58%'s be coming soon :)

    When is the beta due out?

    Posted by: profiler on November 11, 2005 at 04:05 AM

  • Compare the data "Hotspot Client - JSE 6.0 build 59" and "HotSpot Server - J2SE 5.0 Update 5": the detailed data are identical, while the "total performance index" is different.

    How should I read these numbers?

    Posted by: insac on November 11, 2005 at 05:52 AM

  • insac: LU results 910 != 905 it's marginal.

    Posted by: danielmd on November 11, 2005 at 06:08 AM

  • Insac: sorry, my mistake, Insac is right, the tables are equal is it a typo?

    Posted by: danielmd on November 11, 2005 at 06:10 AM

  • Hi insac and all: I fixed the table, see the comment in the top. Sorry about that, I should be sleepy. But the overall conclusions still hold as the most important, average score was correct.

    Posted by: opinali on November 11, 2005 at 08:58 AM

  • this is great, apart from some typos maybe?

    i ran a quamtum mechanical calculation using https://quantumj.dev.java.net/, and it seems to be really fantastic:
    HF/sto3g, water, 7 contractions, single point energy
    5.0 : 2129 ms
    6.0b59: 1057 ms

    so i see a good possibility of using more java in numerical algos.

    Posted by: myjinic on November 11, 2005 at 09:56 PM

  • thats cool.

    i ran a quantum chemical code https://quantumj.dev.java.net/
    with HF/sto3g, for water, single point energy:
    5.0: 1475 ms
    6.0b59: 930ms

    so this appears to be promising.

    Posted by: myjinic on November 11, 2005 at 11:23 PM

  • sorry , the earlier posting was the first time run, the second is average time of 10runs.

    Posted by: myjinic on November 11, 2005 at 11:24 PM

  • Both the 1.5.0_05 Server and the 1.6b59 "unoptimize" the Monte Carlo code. It starts out at a value in the 90s, then falls off by 50% as the tests are repeated in a loop.

    Monte Carlo : 98.72
    Monte Carlo : 90.41
    Monte Carlo : 60.92

    Any idea why this happens?

    Nice speed improvement otherwise!

    Posted by: schnied on November 12, 2005 at 03:15 PM

  • schnied: I'm now with b60, but I tested also with 5.0u5 and in both releases I don't see this degradation of MonteCarlo. If I change the benchmark code to run kernel.measureMonteCarlo() in a loop, the scores are very stable. This is the single individual test where Mustang's Server improves over Tiger (+80%), so it may depend of some recent optimization that may be still on the works, buggy or badly tuned, but even this wouldn't explain this behavior for 5.0u5. What you see is quite odd, because HotSpot Server doesn't recompile anything after the first run of MonteCarlo, at least here and with default options.

    Posted by: opinali on November 14, 2005 at 05:18 AM

  • It means that optimizations for P4/SSE2 were implemented in HotSpot Client /Java SE 6.0 as well (before they lived only in HotSpot Server.)

    It greatly affect floating-point computations but other code does not benefit from them.

    The hardware must have support for SSE2.

    Posted by: doublebass on November 14, 2005 at 09:59 PM

  • doublebass: Good observation, I failed to remember that HSC didn't use SSE previously. This is certainly a big factor for the benchmarks that do FP... and this is 100% of the tests in SciMark2. But I guess that the new register allocator should benefit non-FP code too, see Sun's description of the problems in the old allocator, like poor interaction with inlining. Any routine that has many local variables (incuding those that "steal" locals from inlined methods) should have some benefit, but we'd need more targeted benchmarks to see this.

    Posted by: opinali on November 15, 2005 at 07:48 AM

  • I'm curious as to how the five values shown in the Java 6 build 59 client column are EXACTLY the same as the five values shown in the Java 5 server column. Yet the overall values given (267 client, 410 server) are different. Could there have been a transcription error?

    Posted by: jxt on November 18, 2005 at 06:07 AM

  • Oh well, sorry about that previous post. I would retract it if I could. I got to the original blog entry from a link, and it showed only the old numbers. After I posted my comment, I was taken to the updated entry. Weird. But the previous post should be disregarded as it has already been pointed out and corrected.

    Posted by: jxt on November 18, 2005 at 06:13 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds