Skip to main content

Visual comparison in GUI testing, and a recent "horrible" regression

Posted by robogeek on October 13, 2006 at 4:00 PM PDT

Saw Is Sun's Bug Fixing Policy a Failure or Success? which refers to Horrible JComboBox regression in b99 with WindowsXP L&F ... There's a whole lot to this discussion to consider. What I want to talk about is the difficulty of finding bugs in rendering graphics (like a GUI). I've written on this topic before: Automated visual verification is hard

This regression is an example of the difficulty. In a certain circumstance the JComboBox will render only "..." in the elements, making it rather difficult to distinguish elements from each other. Obviously this is bad for the user of the application.

What makes verifying GUI rendering so hard? People do image comparison all day long every day. That's how we find our way through the world, the brain is continually analyzing the images of what we see in the world and continually detecting through analyzing those images the objects in the world. If it's somehting our brains can do so quickly then why is it so hard to do in software testing?

The short answer is that our computers are not our brains .. our brains have a lot of specialized computation circuits for image and pattern recognition. Our brains have over a billion years of R&D behind them (well, that is if you believe the evolutionists and their theories ... maybe God really did all this as a quick hack one week, and we're all living with the result).

One problem is the sheer number of things you can do with Java 2D. The vast universe of possible renderings is so mind bogglingly large that it makes one feel small in comparison. But actually one could focus on some bounded number of representative graphics renderings and that would make the problem more manageable .. but ...

When you get down to the bottom level, where the pixels meet the screen, there's this interesting question. What makes one image okay, and another one not okay? What makes two images to be the same?

For example .. how would one verify that the text displayed in a Component looked right? Some things you might consider are: did the colors come out right; did the text get onto the screen or were some glyphs not rendered; did the text rendering screw up in some other way; did the text render to the correct coordinates; did some layer substitute the "..." correctly for long strings; is the text rendering similar enough to native rendering to make the public happy; etd...

You might think .. that's just an image comparison. Write a program to compare pixel by pixel and you're done and it would capture all of those questions. Well, the answer is that a simple image comparison is not enough. That was a tantalizing possibility I spent awhile building a test framework to try and do. For some of our graphics tests we do use this framework, and we do some amount of graphics validation through image comparison.

The problem came with maintenance of the "golden image" database.

One of the approaces we took was to define a database schema and store the valid rendering for each test case (on each platform) into the database. That would let us easily check on each weekly build, did the graphics renderings match the golden image or not. Conceptually this is very simple .. you save in the database the correct rendering for each test case, and with image comparison you can determine whether it's still rendering that way.

But what happens when the 2D or Swing engineers make a valid change that makes a valid change in the result of your test case? Well, the golden image is no longer golden, but what you as the quality engineer see is that the test case failed. Maybe you can see that the test case failure is due to a valid change that improves the system, or maybe instead you file a bug. In either case you now have to spend some time replacing the image in the golden database with the new correct/golcen image.

In other words, the rate of change in the graphics rendering creates an overhead for the quality engineer to maintain their database of golden images. Further in some cases the quality engineer will file a bug when it's innapropriate to do so.

At times the 2D engineers are changing things at a rapid pace. They'll try it one way, the next week try it another way, the week after try it yet another way, etc, until they run out of time and we are told to freeze the code so we can ship. That means the quality engineers would be constantly fiddling with the golden image database. And what that meant as a result is we don't use the golden image approach very much, because the quality team decided to limit the overhead we see in maintaining the database.

Rendering text and comparing text renderings is a tricky topic of its own. It turns out that pixel-by-pixel comparison doesn't work because text rendering algorithms have a little bit of nondeterminancy in them. Especially around the edges of glyphs you will often have shadings to avoid jaggy edges, and the shading details vary pretty widely. There's a lot of acceptible variability which humans can put up with, after all. The nondeterminancy makes pixel-by-pixel comparisons show a difference, when the human will say it's perfectly okay.

This is another sort of overhead .. the times an unimportant rendering difference comes up. You might think it would be enough for the comparison algorithm to have some fuzz .. e.g. if less than 5 pixels are different then don't show it as a difference. But I've seen bugs filed where a single misplaced pixel was the justification for the bug. And also it's not a question of the number of different pixels, but where they are.

This whole problem is begging for some image analysis expertise.

Related Topics >>