Graphics Acceleration Geeks: Rejoice!
If you are interested in hardware acceleration for Java2D on Windows, check
out the latest bits on the mustang site (
http://mustang.dev.java.net). Dmitri Trembovetski has been working tirelessly to
implement functionality similar to what Chris Campbell did with our OpenGL
rendering pipeline, and it's pretty stunning. There is now (as of build 33) acceleration for
everything from the standard image copies to translucent image operations
to lines to transforms to complex clips to text (AA and non-AA).
Note: This rendering pipeline is disabled by default for now; there
are various issues we are working through to make this renderer as good
in quality as the default renderer. That quote from Spiderman comes to
mind: "With great power comes great responsibility." Except in our case,
the quote runs more like this: "With great power comes great driver quality
issues"; as we enable more features in Direct3D, we expose more quality and
robustness issues in graphics hardware and drivers that we need to work
around. This driver quality issue is a ripe topic for another article or more; suffice it to say that the hardware and driver manufacturers tend to have a lower bar for "quality" than people tend to expect for Java. To enable the Direct3D pipeline in the current Mustang builds, use the
I thought it might help to dive into one area of acceleration, to explain
why we're doing this, and what benefits you might expect to see.
In both the OpenGL and Direct3D pipelines, we accelerate text by caching
individual characters (glyphs) as tiles in a texture. The first time
you render a glyph (ala
drawString()) to an accelerated
destination (such as the Swing back buffer), we will rasterize that glyph
into the texture and then execute a texture-mapped quad operation
to get that glyph into the right place. The next time you draw that
same glyph, we already have it cached and can simply perform the texturing
It might not be obvious why this is a Good Thing; after all, doesn't it sound
like a lot more effort to do a full-on 3D texture-mapped quad operation than
to simply draw a few pixels for a character into a buffer? Yes ... and no.
in terms of raw instructions executed, that's probably correct; rasterizing
a glyph is a pretty simple operation. And we already have a software cache
for glyphs, so all we really do on repeat operations is to copy the pixels down
from that cache into the destination. Meanwhile, a texture-map operation
requires possible setup of the rendering destination in direct3D, possible
transformation setup, creation of appropriate vertex and texture-coordinate
information for the glyph quad, passing down the call to Direct3D, then the
stuff that the Direct3D driver does before handing it off to hardware, which
then rasterizes the textured-quad. This definitely sounds like a whole pile
But there are two keys here that make the performance win more understandable:
VRAM and parallel processing.
VRAM: Using video memory is all a matter of getting better performance
by locality of memory. Basically, things happen faster if they are located
more closely together.
Let me try a sports analogy. This is a first for me; anyone that knows me
would be shocked that I'd try this. Sports is one of those things that never
really "took" with me. I'm apt to start talking about runs and goals and
tackles in the same metaphor and the whole analogy would fall flat. But I
like to try new things, so here goes:
Imagine a play in baseball (that's the one with hits and runs and outs, right?).
Let's say that the batter hits a grounder that the fielders need to get to
quickly to try to throw the player out at first. If one of the infielders can
manage to get to the ball before it passes out of the infield, then they
can wing it over to first base and have a hope of throwing the person out.
But if the ball goes into the outfield, then whoever gets the ball has to
throw the ball farther, and thus has less chance of throwing
out the batter at first. Here we see the dynamic of locality; if the play
can be kept completely within the infield, then there is a greater chance of
making the out because the ball can travel much quicker to first base.
Whew! Okay, that was a (7th inning?) stretch, but I made it out the other
side at least. Let's take this back into more familiar territory of
The screen exists in video memory (that's where the data lives that the
monitor inputs read from). The Swing back buffer (as of j2se 1.4) also
lives in video memory (I'm talking about Windows here, since this
article is about our Direct3D pipeline; other platforms have different
screen/buffer/rendering dynamics). This means fast copies from the back
buffer to the screen; if they are both in VRAM, then the operation is
going to happen faster. This is because the bits don't have to travel as
far, but it is really because there is a faster data path from VRAM to VRAM
than there is from system memory to VRAM; pixels don't need to go through the
CPU or over the PCI/AGP/PCI-Express bus, they just go through the faster/
wider video card bus.
(Note: The observant reader may notice that my baseball analogy breaks down
here somewhat. VRAM operations are not faster just because of locality, but
also because there is a faster path for local data. If I were to overload
the analogy to account for this, it would be as if the infield players
were the really good players on the team that could throw a whole lot faster
than the outfielders. This is maybe not too far off-base; when
I played little league it was certainly the case that the person playing
right field (that'd be me) was far slower and less capable than the people closer to
The dynamic between the back buffer and the screen also applies to operations
going to the back buffer itself; anything that can happen from VRAM to that
back buffer has the advantages of locality and a faster/wider data path.
In the case of texture-mapping operations, it may be that there is more
happening to copy each individual pixel into place, but these pixels are
being copied from a better location (VRAM) to the back buffer than
the previous approach of rasterizing or copying from system memory to the back
Parallel Processing: Another important factor here that makes all of
this possible is that the graphics chip is a completely separate processor.
So when we're talking about the work involved in rasterizing a texture-mapped
quad, this is all happening on the GPU, not the CPU with the rest of the Java
software stack. In addition to being parallel, the GPU is also highly-tuned
for doing these sorts of operations, so it can probably do a much better/faster
job of them than the CPU could.
I could try to overextend the strained baseball analogy here, where the fielders operate
asynchronously to the pitcher, but that would probably result in the
next play starting while the current play was still happening. Baseball
is confusing enough without throwing multi-threading into the mix.
Between these two factors, using data in VRAM and using the capabilities of
the GPU, it is no longer the case that more complicated operations necessarily
result in slower performance.
Another side benefit of this approach is that more interesting text approaches,
such as anti-aliasing, can be supported with basically no additional performance
hit. Typically, in a software rendering solution, text-antialiasing causes
a significant performance hit. This is because of the increased amount of
stuff happening to rasterize these characters; there is now a read from the
destination pixel and a blending operation to get the smooth edges of each
glyph. Beyond the extra calculations involved here, that simple read can be
quite expensive, especially when the destination is in VRAM. Graphics chips
are really good at doing things in VRAM. They are pretty good at doing things
from the CPU down into VRAM. But they really stink at doing things from VRAM
to the CPU; the read speed of VRAM is really abysmal. So if a software
rasterizer must read from VRAM in order to draw an anti-aliased glyph,
performance will usually suffer.
But with the texture-mapped quad approach to text rendering, there is basically
no extra work going on when the glyphs are translucent. The same operations
occur under the hood, but now they are all happening on the GPU and in
VRAM, which have all the benefits so eloquently and inappropriately layed
out in the baseball analogy above.
So enough about the low-level details. Download the bits, try them out, let
us know what you find. We are continuing to work on it (various performance,
quality, and robustness issues) and will enable Direct3D rendering by default
when we are confident that this renderering pipeline is at least as good
as the default one. In the meantime, you can force it on by using the
Dmitri has just informed me of three bugs that are currently being fixed on our side (not driver issues, actual implementation bugs if you can believe it):
- 6255408: PIT: D3D: Animation freezes when pushing the console to FS mode and restoring it, WinXP
- 6255346: PIT: D3D: VolatileImage is distorted when lowering the color depth at runtime
- 6255836: PIT: ClassCastException thrown when ALT+TABing a FullScreen-page flipping app, Win32
In addition, the pipeline may not get enabled (even when you force it on) in 16-bit color depth; some graphics chips (such as the GeForce MX products from nVidia) have hardware limitations that force us to back off of acceleration in that depth.
If you do see any "issues" on your system, let us know. Be sure to tell us your platform details (especially your OS, graphics chip, resolution, bit depth, and driver version) so that we can chase down and fix the problems.