The Source for Java Technology Collaboration
User: Password:



David Herron

David Herron's Blog

Visual comparison in GUI testing, and a recent "horrible" regression

Posted by robogeek on October 13, 2006 at 04:00 PM | Comments (21)

Saw Is Sun's Bug Fixing Policy a Failure or Success? which refers to Horrible JComboBox regression in b99 with WindowsXP L&F ... There's a whole lot to this discussion to consider. What I want to talk about is the difficulty of finding bugs in rendering graphics (like a GUI). I've written on this topic before: Automated visual verification is hard

This regression is an example of the difficulty. In a certain circumstance the JComboBox will render only "..." in the elements, making it rather difficult to distinguish elements from each other. Obviously this is bad for the user of the application.

What makes verifying GUI rendering so hard? People do image comparison all day long every day. That's how we find our way through the world, the brain is continually analyzing the images of what we see in the world and continually detecting through analyzing those images the objects in the world. If it's somehting our brains can do so quickly then why is it so hard to do in software testing?

The short answer is that our computers are not our brains .. our brains have a lot of specialized computation circuits for image and pattern recognition. Our brains have over a billion years of R&D behind them (well, that is if you believe the evolutionists and their theories ... maybe God really did all this as a quick hack one week, and we're all living with the result).

One problem is the sheer number of things you can do with Java 2D. The vast universe of possible renderings is so mind bogglingly large that it makes one feel small in comparison. But actually one could focus on some bounded number of representative graphics renderings and that would make the problem more manageable .. but ...

When you get down to the bottom level, where the pixels meet the screen, there's this interesting question. What makes one image okay, and another one not okay? What makes two images to be the same?

For example .. how would one verify that the text displayed in a Component looked right? Some things you might consider are: did the colors come out right; did the text get onto the screen or were some glyphs not rendered; did the text rendering screw up in some other way; did the text render to the correct coordinates; did some layer substitute the "..." correctly for long strings; is the text rendering similar enough to native rendering to make the public happy; etd...

You might think .. that's just an image comparison. Write a program to compare pixel by pixel and you're done and it would capture all of those questions. Well, the answer is that a simple image comparison is not enough. That was a tantalizing possibility I spent awhile building a test framework to try and do. For some of our graphics tests we do use this framework, and we do some amount of graphics validation through image comparison.

The problem came with maintenance of the "golden image" database.

One of the approaces we took was to define a database schema and store the valid rendering for each test case (on each platform) into the database. That would let us easily check on each weekly build, did the graphics renderings match the golden image or not. Conceptually this is very simple .. you save in the database the correct rendering for each test case, and with image comparison you can determine whether it's still rendering that way.

But what happens when the 2D or Swing engineers make a valid change that makes a valid change in the result of your test case? Well, the golden image is no longer golden, but what you as the quality engineer see is that the test case failed. Maybe you can see that the test case failure is due to a valid change that improves the system, or maybe instead you file a bug. In either case you now have to spend some time replacing the image in the golden database with the new correct/golcen image.

In other words, the rate of change in the graphics rendering creates an overhead for the quality engineer to maintain their database of golden images. Further in some cases the quality engineer will file a bug when it's innapropriate to do so.

At times the 2D engineers are changing things at a rapid pace. They'll try it one way, the next week try it another way, the week after try it yet another way, etc, until they run out of time and we are told to freeze the code so we can ship. That means the quality engineers would be constantly fiddling with the golden image database. And what that meant as a result is we don't use the golden image approach very much, because the quality team decided to limit the overhead we see in maintaining the database.

Rendering text and comparing text renderings is a tricky topic of its own. It turns out that pixel-by-pixel comparison doesn't work because text rendering algorithms have a little bit of nondeterminancy in them. Especially around the edges of glyphs you will often have shadings to avoid jaggy edges, and the shading details vary pretty widely. There's a lot of acceptible variability which humans can put up with, after all. The nondeterminancy makes pixel-by-pixel comparisons show a difference, when the human will say it's perfectly okay.

This is another sort of overhead .. the times an unimportant rendering difference comes up. You might think it would be enough for the comparison algorithm to have some fuzz .. e.g. if less than 5 pixels are different then don't show it as a difference. But I've seen bugs filed where a single misplaced pixel was the justification for the bug. And also it's not a question of the number of different pixels, but where they are.

This whole problem is begging for some image analysis expertise.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • Finding the bug is hard, but that's why we are here. Choosing not to fix it is really what this is about. It's a known bug.

    Posted by: mikaelgrev on October 13, 2006 at 04:19 PM

  • The first link has an extra ? attached to it.

    Posted by: kirillcool on October 13, 2006 at 05:42 PM

  • How about some OCR techniques at least for text verification? These exist for a very long time and are very good, especially when the input is clear and not "distorted" by printing artifacts.

    Posted by: kirillcool on October 13, 2006 at 05:48 PM

  • Mike - my job is to think about better ways for the quality team to find bugs. That's what my posting is about. That it's in the context of discussion about a specific point is merely the starting point for my posting.

    Kirill - I've looked a little into OCR and not been happy so far. What I've seen so far has only supported reading English text and generally also suited to reading text that's in horizontal lines. I agree that a rendering out of Java would be a kind of ideal case for OCR, as opposed to e.g. a FAX that's been xerox'd through several generations. But another big aspect of our problem is support for non-English text.

    Posted by: robogeek on October 13, 2006 at 09:02 PM

  • Google Scholar returns about 4220 results on "text recognition". Arabic is on the first page, Devanagari on the third and i'm sure other languages are up there. Sure, this may not be free, but what's a extra few hundred (or even thousand) dollars for a commercial license that can spare such an embarrassment?

    Posted by: kirillcool on October 13, 2006 at 09:09 PM

  • Hi David--I've been thinking about this very same issue as regards the Flying Saucer project, where we take XML/XHTML and apply CSS to render to either a Java2D or a PDF canvas. Our tests are mostly visual, e.g. we have a test base of pages we run through and verify visually. We have some model-oriented tests, but at a much lower level, for CSS interpretation (cascade, inheritance). I've been wondering how to improve this to catch errors earlier.

    One idea is that there is a "model problem" and a rendering problem. The "model problem" is about testing things like layout attributes and rendering characteristics (color, fonts, etc.). You can test the model, in principle, mostly independently of the rendering target. For Swing, this would be a rule-based set of checks verifiying that component layout was accurate, for example (to a certain degree of accuracy). In our case, that would be like checking our layout model, which is used in the render cycle--e.g. that a div assigned a CSS width and height of 100 ended up having a width and height of 100 in the layout model.

    The second problem is the rendering problem, where you need to check if the rendered output is as expected. With respect to rendering, I have two ideas. One is that some simple rules will always apply, for example--render to a canvas, extract the image, and verify that the component border lines are within some range on the rendered image.

    The second set of rendering tests require a person to visually compare images. What I think you need in that case is a testing process where the tester can "flip through" a number of panels, with each panel representing a limited number of checks. For the combobox example, the panel might show a small set of comboboxes stacked vertically or aligned horizontally, a single jbutton in the corner of the panel to expand the comboboxes, and a fixed, known data set in the combobox model. Across the top of the panel (as with all the test panels) is an instruction (in large print) on what to look for. To make it easier, some big red arrow glyphs or highlights could be added to a glasspane to draw attention visually to the area where one needs to look. In Flying Saucer, a rough analogy are the CSS rendering tests provided by the W3C, which is a large set of XHTML/CSS pages, each of which tests one CSS rule (or just a couple), allowing you to flip through the pages and quickly spot where you have problems.

    With such a "flip through" system you can take a test suite and work through it pretty quickly, maybe even entering notes as you go on the bottom of each panel, capturing screenshots automatically, etc.

    Just a thought. It's certainly an issue for us, and I can see how it's a big problem on your end.

    Best regards (and good luck!), Patrick

    Posted by: pdoubleya on October 14, 2006 at 01:36 AM

  • Patrick, this "model problem" approach is what we've called "layered testing". There's a lot of internal checks of the sort you mention that can be done. But at the end the stuff gets rendered to the screen and needs to be checked visually. The "flip through" method is interesting .. but don't you think the tester will get eye fatigue after awhile?

    Posted by: robogeek on October 14, 2006 at 07:53 AM

  • Kirill, oh, cool, I didn't know about this Google Scholar service.

    Posted by: robogeek on October 14, 2006 at 07:56 AM

  • David, one of the more popular and comprehensive sites before Google Scholar came to play used to be CiteSeer. It contains links to almost all the featured articles, and while some of them build upon previous work, and most of them don't feature complete code, you don't have to go that route. Surely there are a lot of industrial strength commercial libraries that you can buy for internal use.

    Posted by: kirillcool on October 14, 2006 at 08:06 AM

  • So the new JComboBox has a regression to the previous one, which worked correctly? Hmm....I suspect I am not the only one, who really knows the correct word: It's a degradation.Doubtless the proper term is not politically palatable to management; but isn't it irresponsible, as there are harms created as a result of these introduced errors, to use a completely antonymous euphemism?If we as software professionals, cannot insist on correct terminology; then perhaps we should appropriately use my two-year old nephew's favourite, and more accurate term: "whoopsie-doodle."

    Posted by: cajo on October 14, 2006 at 11:02 AM

  • This fix was tested and did work. The new bug was actually caused by a merge error. I was working on several combobox fixes at the same time and screwed up the merge when doing conflict resolution. So it is entirely my fault. The original fix worked and was tested. It was my merge error that caused this. We are trying to get this fixed and I will have an update blog later this week. Thank you for your patience.

    Posted by: joshy on October 14, 2006 at 01:01 PM

  • Regression is the word we use at Sun ... as in Mustang Regressions Contest.

    Do you call the stuff that cars spew into the atmosphere pollution or the more correct term poison?

    That's the point you're making, right? That calling something by a weaker word its easier to brush under the rug? Is that the allusion you're making? In the Java team we have a policy around regressions (our word) and I don't know how the policy applies at the stage of the release cycle we're in. I understand that the decision is between fixing it now or fixing it in update 1.

    Posted by: robogeek on October 14, 2006 at 01:03 PM

  • Hi David--as regards eye fatigue, obviously you want to build the test suite such that they are in fact practical to carry out; if people got tired running them and then skip tests it would of course be of less value. My guess would be that,
    a) you can batch tests up into small sets, typically requiring 15 minutes review, with breaks between sets (where the tester can take notes, log bugs, etc)
    b) you can work on the layout of the forms to try and make the visual comparison as easy as possible--for example, overlay "pay attention to this" red circles or arrows to draw the eye to the part of the component that one needs to focus on
    c) reduce the amount of visual information on each test panel to a minimum--each panel tests a very limited number of features, and follows some standard layout guidelines to reduce visual clutter and overhead of processing new (and complex) layouts visually
    I think it's something to play with. The W3C CSS 2.1 test suite is a set of XHTML pages which test specific features of CSS. They simplify review by having for example just two lines of text saying "This line should be green/this line should be red", where the lines are styled using some cascade or inherit. You have two lines, one green, one red--check that and move on. In some cases you could go a step further and have a base comparison to the side--so "this line should be green" is lined up with a GIF block that is green--you just verify they have the same color (test being that the CSS for color is assigned properly and results in the same color as the GIF).
    Cheers--Patrick

    Posted by: pdoubleya on October 15, 2006 at 03:10 AM

  • I think the solutions are overly complex and won't pay off. Use the community's will to find bugs instead by fixing the crappy bug reporting system.

    Posted by: mikaelgrev on October 15, 2006 at 03:19 AM

  • If you can generate a ton of before/after pictures.. Maybe you can use Amazon's mechanical turk project to find problems:
    http://www.mturk.com/mturk/welcome

    Posted by: dog on October 16, 2006 at 06:57 AM

  • How about using Amazon's Mechanical Turk for image comparison/regression testing purposes. Since obviously humans can do this task much better than computers, you need to solve the organizational problem of distributiong the test cases and collecting the results. Multiple levels redundancies will make it more robust.

    Posted by: karakfa on October 16, 2006 at 10:30 AM

  • The mechanical turk is an interesting idea. It wouldn't work for all test cases as in some cases the bug to be caught would be a few pixels out of place. Where it would work, and in general where humans work for this, is where the difference is pretty obvious.

    The bug in question would lend itself pretty well to software based analysis. We now have in the 2D SQE team someone with software image analysis experience, and he has some interesting ideas. For example a simple test would be to count up the number of pixels in a given image that are close to the desired color. If that count is significantly different than the expected count then you know there's a problem. No need for OCR but instead it's relying on statistics.

    Posted by: robogeek on October 16, 2006 at 10:42 AM

  • mikaelgrev, You suggest fixing the crappy bug reporting system. I agree completely. In fact I made a posting a month or so seeking input for the bug reporting system. What would you like to see in the bug reporting system. You're implying you think if there were a better bug reporting system, it would make collaboration between Sun and the community more better. What features in the bug reporting system would you think increases collaboration?

    Posted by: robogeek on October 16, 2006 at 10:44 AM

  • There's some info on this subject here:

    http://www.javalobby.org/java/forums/m91825013.html#91825013
    http://www.javalobby.org/forums/thread.jspa?threadID=15419&tstart=0
    http://www.javalobby.org/java/forums/m91811528.html#91811528
    http://today.java.net/pub/pq/24

    Cheers,
    Mikael Grev

    Posted by: mikaelgrev on October 16, 2006 at 11:33 AM

  • I have to disagree having worked quite a bit with Sun teams in testing phone applications (far worse, imagine manual device testing requiring 20 people to run tests on devices for 2 weeks of at least 12 hour days!).

    We had several cases such as this specifically with SVG related functionality, our lead SVG engineer built a tool that does just that. The thing is that all SVG tests are visual mostly with a single end result, those tests marked as such had a screenshot which was compared to the result. If the comparison was OK the test passed without interaction if it failed the image was displayed with the original and the tester had to approve the result...

    There is a huge difference between automated test execution and manual comparison, with every release candidate (and often once a week regardless) you have to run all the tests... If this just consists of running the test (no interaction) this is 0 work and bugs can be caught early. If we updated the undelying drawing primitives all the tests would fail, but that is a small price to pay for the freed up tester cycles (we always need more testers).

    Most mature tools contain inner code to perform such comparisons and updates, even Sun's lame tools (JavaTest ugh) can be updated to include decent support for that with the "update" feature. You don't need to use a "database" in the relational/SQL sense for something as simple as this... You can even use a file system because only the last revision known as "good" should be saved.

    Now what if a tester clicks "good" without seeing something important?
    First this should be discoraged, clicking "good" should be considered harmful without a second eye (or a sequence of confirmations). Second, now that you saved tester count it is much easier to use all the testers to run the tests individually ;-)

    The end result: better quality bugs from the same testers...

    Posted by: vprise on October 16, 2006 at 02:34 PM

  • Hi David,

    I've just found (Yes I'm late ;)) your blog entry about visual comparisons when I was searching for visual tests related information via Google. I'm interested in this topic as I created something maybe similar to what you have created: a visual comparison framework in order to avoid visual regressions between two versions of our air traffic control related swing application. We use it since spring 2006 and constantly improve it.

    Recently, we also have chosen to release it as an Open Source Software: jDiffChaser ( http://jdiffchaser.sourceforge.net and http://sourceforge.net/projects/jdiffchaser). The main principle of jDiffChaser is, having a (mostly human) validated version of your application, check if the rendering differs on the latest development version to early detect some rendering regression (bug or undocumented features ;) ). It's a simple tool as it does simple things. The actual version (0.5.1) allows to automate visual comparisons on remote computers as well as (of course) on the localhost, to choose parts to ignore during comparisons, to have a remote control frame to record scenarios, to have a html report displaying diffs images. Of course, it needs to be improved again and again as we encounter new needs in the visual tests area every week, so we're working on it. Any feedback and help is welcome.

    I'm very happy to find other people who also think that we need to find ways to help testing the rendering of an application, not only its functional part. Keep thinking!

    Posted by: jeylay on June 18, 2007 at 04:25 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds