Skip to main content

Threadaches

Posted by chet on August 19, 2004 at 3:42 PM PDT


Metaphorical Introduction

I find myself trying to multithread my life constantly. I've got so many things to
do; surely there's a way I can multiplex the chores to get all of them done
faster, right?

For example, I'll be brushing my teeth and realize I also need to comb my hair.
I'm only using one hand to hold the toothbrush, so I reach for the comb with
the other hand. Then I'll start combing
my hair, at which point the other hand with the toothbrush stops moving, or (even
worse) keeps moving, but in such as way that toothpaste starts running all over the place.

Now I've gone from two actions that would have been simple to perform serially
to one interleaved complex action that's resulted
in a toothpaste-and-spittle mess and an unruly mop on my head.

The problem here is that, despite the speed of our processors and the simplicity of
our actions, some things in life simply cannot be multithreaded.

This is not to say that multithreading itself is unachievable or not worthwhile for
some operations. As an example, consider breathing or blinking; if I had to stop any other
action to make my eyes blink or to breathe, I'd never get anything done. Consider
multiplexing actions while driving; if we couldn't do many things at once in this
environment, we'd all be dead. (Of course, many people assume that this multiplexing-while-
driving capability is universal and extends to complex conversations on phones while
driving at 80 on the highway; I believe these folks will eventually be evolved out
of our society, although they may take a few of us with them along the way).

So there are clearly some operations which are better done on separate threads; the trick
is figuring out which ones are which if you don't want to be wiping up toothpaste off your
clothes all the time. Or, in the case of your applications, if you don't want to be debugging thread deadlocks or performance problems due to thread abuse.

That reminds me; I was going to write about software in this article, not toothpaste and driving.
I knew there was a reason I was posting this to a developer site...


Now, to get Technical

I've spent the better part of the last year thinking about multithreading problems, and I could
probably spend the rest of my career doing the same (although it probably wouldn't be a long
career since my head would pop off before too long). There are so many issues in this
ugly space that to write about all of them would take too long. So for the purposes of a
focused blog (although it's probably too late for that goal), let's just focus on one
area of trouble for Java client developers: multi-threaded graphics.

When developers first discover the joys of multithreaded programming, it's like opening a
new present on Christmas; the power of Java thread creation is that it is so easy to
create and use new threads, that you can start taking advantage of multithreaded processing
in a much easier fashion than you ever could before in previous programming languages.
And the knowledge that your application can perform multiple tasks simultaneously, especially
on systems with multiple processors or hyper-threaded cores, is huge. You no longer need
to block on IO, waiting for the system to open a file. You no longer need to hang in your
display code waiting for an image to downloaded over the network. You no longer need to
hang the GUI while calculating a complex set of equations. In all of these cases, it is
trivial to spawn a new thread to perform the bottlenecking operation, and use the result
of that operation later as appropriate.

In fact, Java builds in much of this functionality for you, so you don't even need to
do the work of multithreading some of your operations which may be blocking. For example,
the old Toolkit Image loading methods automatically load images on a separate thread. The
following code:


Image img = Toolkit.getImage("duke.gif");

would return without having loaded the image. Instead, the method puts the request to
load that image on a separate image-loading thread. Later, when you need to use img,
you may need to make sure it's loaded first
(see my blog on Image APIs
for more information on this asynchronous behavior).

"Well heck," the Happy Developer says. "If two threads make my application that much
faster, imagine how fast it'll be with ten!"

For example, suppose a developer finds that one of the more costly operations in their
application is some rendering operation, like drawing some complex Shape.
Every frame they have to draw numShapes of these shapes, like so:


// Assume myShapes[] and numShapes are initialized appropriately elsewhere...
public void paintComponent(Graphics g) {
    for (int i = 0; i < numShapes; ++i) {
    g.draw(myShapes[i]);
    }
}

"Golly!", the Happy Developer says, "What if I use that cool
Thread mechanism to speed this up?! Then it'll go way faster."
They might write something like the following:


class ShapeRenderer implements Runnable {
    Graphics g;
    Shape s;
    ShapeRenderer(Graphics g, Shape s) {
        this.graphics = g;
        this.shape = s;
    }
    public void run() {
        graphics.draw(s);
    }
}
public void paintComponent(Graphics g) {
    for (int i = 0; i < numShapes; ++i) {
        Thread t = new Thread(new ShapeRenderer(g, myShapes[i]);
t.start();
    }
}

They compile the code, run it hopefully ...
and discover that it didn't fix their performance problem. In fact, they discover that
the app is actually much slower than the original code.


What gives?

There are actually several different factors that can contribute to the performance of
this particular approach, ranging from things that add no benefit
to those factors that actually make multi-threading this application slower.
Let's go through some of these factors, one by one:


1) Thread/object creation overhead



Okay, so we've all heard the mantra that object creation and destruction (and the associated
garbage collection process) is actually pretty fast. Memory allocation and destruction
on current garbage collection systems is almost free, in fact; it costs only the instructions
involved in moving a pointer around (memory is assigned by reserving space on a pre-allocated
heap, and is destroyed by another pointer move. Of course, there are more details here
about how garbage collectors, but suffice it to say that simple creation/destruction
of temporary objects is very, very cheap).

However, that doesn't mean to say that creating temporary objects is actually free. In particular,
it doesn't mean that you want to create and initialize objects in your inner loop if you
don't have to. In the above example, we are not only asking for temporary memory for the
Thread and ShapeRenderer objects (a cheap operation),
but we are also asking that those objects get created
and initialized, which may not be so cheap, depending on the complexity of the objects
involved and whatever initialization process they need to go through. In this case, the creation
of a thread will probably involve a fair amount of processing at either/both the Java and
native levels in order to create the underlying thread object.

The Happy Developer, realizing this, will of course take steps to minimize the temporary
object creation. In this situation, they may realize that since the same thing happens every
time through the loop, there is no reason that the applications needs to create the Thread
and ShapeRendering objects every
time through; they can just incur the overhead of creation one time and then reuse these
objects whenever we need them.

I won't bother with an example here, just picture a variation of the above where the Thread
and the ShapeRender objects are created only once. Then, inside the
paintComponent() method, we need only update the Graphics object of each ShapeRenderer and
then tell each Thread to do its thing.

Once again, the Happy Developer (this time with a slightly less huge smile of
anticipation on their face) awaits the stunning results ... but discovers that this new
variation is still worse than the original approach.

Things may be a bit better here than before; at least the application is not going through
the contortions of creating and initializing the Thread and ShapeRenderer objects every
time through the painting loop. But there's still something amiss in the messy threading
details.


2) Thread Swapping

One of the hidden details of multithreaded programming is that the operating system has to
go through a fair amount of work in order to run a separate thread. This is not much in
the whole scheme of things (less than a millisecond, certainly), but it can add up when there
are several threads involved.

For example, if you have ten threads all trying to do similar tasks at the same time and at
the same priority, then the system will keep swapping the threads in and out trying to get the
work done. This may not be as bad as swapping out each thread after just a couple of instructions
in some round-robin fashion; we may get a fair chunk of work done in any given thread before
we are swapped out. But the amount of work accomplished on that thread must be weighed against
the work done to swap threads to know whether it was worthwhile having multiple threads to
accomplish the task.

I wrote a test to see what thread-swapping overhead was. I ran in a tight loop, calling
wait/notify to swap the threads back and forth. For 100,000 swaps, it took 1.3 seconds
on that particular test system. This doesn't sound like much, but if you can imagine each
thread trying to perform something simple like drawing a single line, the fact that
we could only do 100,000 of these operations in that 1.3 seconds makes the thread swap
overhead seem pretty significant. (For comparison purposes, I also timed calling a function
100,000 on the same system, which took only about 10 ms).

The cost of thread swapping overhead comes into play especially on systems where there are
less computing resources available than there are threads than want those resources. This
is an excellent, if obvious, segue into my next point...


3) Limited Thread Resources

The ideal case for multithreaded systems is having one processor per thread, or at least
one processor available whenever a new thread needs processing power. For example, if you have a
four CPU system and there are four threads all trying to run at the same time, they can
each have a processor to themselves and get along swimmingly. There need be no thread-swapping
overhead, as mentioned in point #2 above, because the threads do not have to be swapped
out; they have full control over their CPU (at least while the process is running).

There are now hybrid systems where single chips can have multiple resources available for
threads, such as the Hyper Threaded CPUs of various chips today. While these systems cannot
dedicate an entire CPU to a particular thread, they have enough resources on any CPU
to dedicate some of those resources to separate threads. There is still some overhead of
thread swapping and contention here, but it is at least better than the single-CPU model.

The problem with the sample application above is that it does not take into account anything
about the system when it creates a thread per Shape. Unless the application is running
on a system that has as many CPUs (or at least an many thread resources, in the hyper threading
case) as there are threads, then there is going to be contention in thread processing
and thus overhead for swapping threads in and out.

Our Happy Developer might realize that the available thread processing resources could be a bottleneck.
Suppose they know that,
in general, their application will run on systems with at least 2 processors or one hyper-threaded
processor, and they would still like to take advantage of that capability in multithreading their
application. They may then change their app to use a model of exactly two rendering threads
instead of the one-per-Shape model above. Now, instead of sending each Shape rendering operation
through its own dedicated thread, it will queue up these operations on two separate threads.
The code will be more complex for this approach, but still fairly straightforward. For
brevity, I'll skip the example, but hopefully it's easy to picture this thread-sharing approach.

Again, the Happy Developer awaits patiently the results they know will stun the world. There
is a slight faltering of their smile this time; they have met defeat too many times in the past.
Still, they look forward to ... more failure.

Once again, the application fails to improve upon the original single-threaded approach.
This time, they have eliminated
much of the overhead in dealing with threads. And when running on a system than can
process multiple threads simultaneously, they may even have eliminated thread-swapping overhead
(or at least reduced it significantly).

Given the power of doing multiple tasks simultaneously, and the ability of the system to handle
this simultaneity, what happened?


4) Graphics is Inherently Single-Threaded

Here's the sad reality of today's computing platforms; graphics hardware is
inherently single-threaded.

This single-threaded approach is so ingrained in today's platforms that it is an
assumption at the hardware, the driver, and even the API level (although some APIs are
written to handle multi-threaded programming, they do not do a good job of
compensating for the limitations below them and the hardware and driver level)..

While the hardware architectures, processors, operating systems, and languages have evolved
to allow and encourage multi-threaded programming, the underlying graphics engines simply
cannot do it.

This means that you may be able to easily write an application that performs graphics operations
in multiple threads. And the system you are using (e.g., Java) may turn around and issue those
graphics calls in separate threads. And the underlying systems that Java depends upon
(e.g., X, DirectX, GDI, whatever) may be able to receive those calls from multiple threads.
But when it finally gets down to the hardware, it has all been funneled through one pipe
and there is no way to get any advantage by trying to use multiple threads at a higher level.

Just like the chain that is only as strong as its weakest link, an application is only as
multithreaded as the systems it depends upon; in this case, if the graphics subsystem is
single-threaded, then there is nothing you can do at the upper layers to make that
system more multi-threaded-friendly.

Let's take a look at an example. I was working on this one just this week, playing around with
various options in double-buffering. I wanted to play with the idea of writing to a buffer on
one thread and copying that buffer to the onscreen window on a totally different thread. There were various
reasons for this (and various possible gotchas), but it was at least worth an experiment.

I wrote my rendering loop to fill the buffer as fast as it could, with simple
calls that mimicked a scrolling operation (copy part of the buffer to itself in a different
location, fill in the rest with some color). After each operation, it would
update a flag that told the system that the buffer contents were new (and should thus
be copied to the onscreen window at some time).

I wrote my buffer-copying thread to occasionally wake up (every few milliseconds) and copy the
buffer to the screen if the buffer had been changed since the last copy.

There are a few implementation details here (such as synchronizing on the update flag variable),
but I have described the essential bits.

What I saw confused me at first. It ran pretty well on my development system. So I had someone
else run it on their system. In that new environment, it basically froze the window for several
seconds. It looked like we were not even getting our screen-
updating loop, as if the system was not even kicking off the timer I had set.

In digging into it further, I found the artifact was more disturbing. We were being woken
up correctly based on the timer, and were then attempting to update the screen. But the buffer-rendering loop
had so completely filled the graphics pipeline with scroll/fill calls that we basically
froze the system while those were being worked on by the underlying driver and hardware.

Here was a situation where:

  • the language supported the threading approach
  • the underlying subsystem (in this case, DirectDraw) supported rendering to/from multiple
    threads
  • the operating system supported multiple threads


BUT

  • the graphics rendering system was created as inherently single-threaded, so the fact that
    one thread could so completely fill the graphics pipeline meant that future threads that
    the operating system wanted to swap in would simply have to wait for the original thread.

The upshot of this whole diatribe on the graphics subsystem is that there is basically no
gain to be had in changing the original sample application in the way that we did; since the
underlying system is inherently single-threaded, we have nothing to gain and everything to
lose by introducing potential thread overhead when everything will just end up in a single
thread in the final rendering step in the hardware.


Conclusions, Thoughts, Powertool Accidents

The point of this article was to raise some issues to be aware of in multithreaded
programming. Note that I am specifically not saying "Don't Do It!". There
are lots of advantages to multithreaded approaches, some of them mentioned in the
introduction above. For example, it is still a huge win to do time-consuming non-graphics
tasks in a separate thread (such as IO or image loading) if you do not want to block
the main thread (a big example in my desktop client world is the canonical "don't
block the GUI thread" example; all blocking operations should happen elsewhere so that
the user still sees a snappy GUI)..

And there are even cases where graphics operations can be effectively multi-threaded.
For example, in a system that is using all software rendering (such as rendering
destinations that are not hardware-accelerated, such as BufferedImage objects) running on
systems that support multiple threads, there could be a huge win in having separate
rendering threads. Imagine a multi-chip server that is producing separate images
all in parallel, rendering to each with software loops, running each of those loops
on separate processors; this is a major win for parallelism.

But what I am saying is: be aware of the issues in the platform and underlying systems
when taking a multithreaded approach. And always test your application to see
if you actually got the speedup you were anticipating.

Think of multithreaded programming kind of like a chainsaw. It can be an incredibly
powerful tool that can dramatically reduce the time needed to perform some tasks.
Or it can chop your hand off and cause a huge mess that's impossible to recover from.
You need to know how to use it effectively to determine which one it will do for you.

Related Topics >>