The Source for Java Technology Collaboration
User: Password:



Stephen Friedrich

Stephen Friedrich's Blog

Natural String Order

Posted by skelvin on January 13, 2006 at 06:53 AM | Comments (7)

Treats contained numbers sensible when comparing strings

Natural Order
(e.g. implemented in Windows XP's file explorer)


rfc 1.txt
rfc 822.txt
rfc 2086.txt
            

Java's default Order
Somewhat stupid...


rfc 1.txt
rfc 2086.txt
rfc 822.txt
            

(Kudos to Martin Pool for the example.)

When I needed a natural sort feature in my Java app I started to google and to look in a few other places (like Apache Commons). There should be plenty of code out there that does this, right? Nope. I found a page with a description of the problem and links to implementations in a couple of programming languages: Natural Order String Comparison

There's a single Java implementation, but that one cannot handle texts that contain non-ascii characters. So it would fail in my German application, where texts contain funny characters such as these:
Im Übrigen macht's mehr Spaß, es selbst zu schreiben.

The code

Consists of a couple of static methods in class Strings: Strings.java
Here is the Javadoc.
Some junit tests: TestStrings.java

Basically there are some variants of methods compareNatural...(String, String) that directly compare strings and convenience function, if you need a string Comparator: getNaturalComparator...(). The different variants either use either the current default locale's comparison rules, allow to specify a Collator to use or work on 7-bit ascii characters only.

Java 1.5: There is a little bit of generics used (e.g. Comparator<String>). It's easy to change if you require the code to work on older Java versions.

Basics and Intricacies

Basically natural comparison works by breaking each string into parts where each part is either a number or text-only. Number parts are compared according to their corresponding numeric value while text parts are compared lexicographically. However simple this seems, there are quite some subtleties to handle.
A nice, detailed explanation is also given in this spec from the OpenOffice project: Natural Sort Option in Sort Dialog

Tokenizing

Exactly how to determine which parts of a text are numbers can be surprisingly difficult. The two complicating issues are non-integer numbers and internationalization.

My implementation handles only integer numbers, so "a 1.3" is considered less than "a 1.11" because each is broken into four parts ("a ", "1", "." and "3" resp. "11") and 3 is less than 11. The need to handle fractional numbers should arise very rarely and trying to handle these opens a can of worms (different decimal point representations in different locales, scientific formats, ...).

Digits are defined differently in different locales. Currently my implementation starts a new token whenever the value of Character.isDigit(c) changes.

Internationalization

My minimum requirement was to compare the text parts of the strings using locale specific rules (which means you can specify a java.text.Collator). The code should work fine for any language that uses the number characters '0'...'9'.

Other number systems are not tested (they will probably not work, because I do not really convert the number to a numeric value, but compare each digit separately).

Supplementary characters

Once upon a time it seemed to me that Java's native unicode support was the end of all character set problems. No more, no more: The complete Unicode definition now contains character that cannot be represented by a single (16 bit) char variable, see this article from sun http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

These supplementary characters are simply not handled correctly in my natural comparison implementation.

Numbers with leading zeros

How should the strings "p0002", "p2", "p02" be sorted?
Strictly comparing the numeric values of the number parts would leave the sort order undefined, as each of these strings will be considered equal.

I have chosen to sort by increasing numbers of leading zeros: "p2", "p02", "p0002". That way Strings.getNaturalComparator().compare(String, String) and String.equals(String) are still compatible: compare() will return 0 exactly if equals() returns true.

Whitespace

How do "p4" and "p 3" compare?

I have choosen not to ignore whitespace, so "p4" is less than "p 3". I'd like to be alerted to difference in the naming scheme, so I think this is reasonable. Maybe later I'll add an option to control the handling of whitespace.

Performance

First of all I don't think performance really matters in the typical application of natural comparison. You probably don't have hundreds of thousands of files in the same directory whose names differ only in a number, do you?

Still, the very simple approach of breaking the string into parts (effectively creating two list of strings) and then comparing each string separately might be too costly. So I took care to avoid creating new instances where possible and do all comparisons in place. For many scenarios the natural comparison is about as fast as Java's String.compareTo() method. If there are many strings that have a common prefix and contain a number that differs in trailing digits only the natural comparison is about four times slower.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • I actually wrote a fairly concise version of this and posted it to my blog a few weeks back. Because I'm lazy, I used regular expressions to splice out the chunks of the String and then just used Number.compareTo and String.compareToIgnoreCase. The whole implementation is much smaller, though I've no idea how it would compare runtime-efficiency wise.

    You can see the blog entry (and code) here.

    Posted by: tfenne on January 13, 2006 at 09:57 AM

  • Thanks! Just integrated your code into my application. Really nice.

    Posted by: pago on January 13, 2006 at 11:23 AM

  • what was the problem with Pierre's code? Theres nothing obviously ascii-dependent there (he is french, after all) - the only problem I could see would be with supplementary chars which you don't handle either? (ie his code always deals with chars as chars, not bytes or ints)

    Your own code has a method called compareNaturalIgnoreCaseAscii which doesn't seem to use ascii at all? It does comparisons using 'toLowerCase/toUpperCase' which use the unicode tables; and it doesn't try to use the byte value of characters, but consistently uses their UCS-2 (ie not-ascii) value?

    Posted by: bazzargh on January 13, 2006 at 04:55 PM

  • The problem with Pierre's code is that even though it indeed can compare any unicode character, it does so incorrectly by using its unicode code point.
    For example in german the the correct order for "ä" (a-umlaut) is "a", "ä", "b". Using only the numeric values gives "a", "b", "ä".
    So it cannot be used for non-Ascii characters because for those characters the desired sort order is _not_ determined by the numeric unicode code point.

    This is an actual problem: I frequently had complaints that users cannot find entries at all (When it turned out they just never looked at the end of list of last names for a name starting with an umlaut.)

    The only chance to do correct locale specific comparisons in Java is using the Collator class.

    Posted by: skelvin on January 15, 2006 at 04:23 AM

  • tfenne, I haven't found your code because I didn't searched the web for "humane sorting" ;-)
    Still, I would have had to enhance your code to do locale specific comparisons.

    Posted by: skelvin on January 15, 2006 at 04:27 AM

  • This looks great, but why is there no method "compareNaturalIgnoreCase" which sorts [a, b, ä] correctly AND ignores case?

    Perhaps I'm missing something, but I'd say this is the type of ordering most often preferred by users?


    Posted by: marc_ on January 17, 2006 at 04:19 AM

  • marc_, you can get case-insensitive sorting by setting the strength on the Collator to anything other than the default IDENTICAL (but I agree that this should be built in, if not the default).

    Collator myCollator = Collator.getInstance();
    myCollator.setStrength(Collator.TERTIARY);
    Collections.sort(myList, Strings.getNaturalComparator(myCollator));

    The trouble with using a collator is that they sort whitespace and punctuation very differently than a simple ASCII-based comparator. Spaces and hyphens are treated as word breaks but otherwise ignored unless they're needed to break ties, and the other punctuation characters have a different sort order as well. Furthermore, if you set the strength to PRIMARY, spaces and hyphens are completely ignored, which radically changes the sort order again. I think you'd get more predictable results if you were to use the collator only to compare letter sequences, and sort everything else yourself.

    Posted by: uncle_alice on January 22, 2006 at 01:33 PM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds