Skip to main content

Natural String Order

Posted by skelvin on January 13, 2006 at 6:53 AM PST

Treats contained numbers sensible when comparing strings

Natural Order
(e.g. implemented in Windows XP's file explorer)

rfc 1.txt
rfc 822.txt
rfc 2086.txt
           

Java's default Order
Somewhat stupid...

rfc 1.txt
rfc 2086.txt
rfc 822.txt
           

(Kudos to Martin Pool for the example.)

When I needed a natural sort feature in my Java app I started to google and to look in a few other places
(like Apache Commons).
There should be plenty of code out there that does this, right? Nope.
I found a page with a description of the problem and links to implementations in a couple of programming languages:
Natural Order String Comparison

There's a single Java implementation, but that one cannot handle texts that contain non-ascii characters.
So it would fail in my German application, where texts contain funny characters such as these:

Im Übrigen macht's mehr Spaß, es selbst zu schreiben.

The code

Consists of a couple of static methods in class Strings: Strings.java

Here is the Javadoc.

Some junit tests: TestStrings.java

Basically there are some variants of methods compareNatural...(String, String) that directly compare
strings and convenience function, if you need a string Comparator: getNaturalComparator...().
The different variants either use either the current default locale's comparison rules, allow to specify a Collator
to use or work on 7-bit ascii characters only.

Java 1.5: There is a little bit of generics used (e.g. Comparator<String>). It's easy to change
if you require the code to work on older Java versions.

Basics and Intricacies

Basically natural comparison works by breaking each string into parts where each part is either a number or text-only.
Number parts are compared according to their corresponding numeric value while text parts are compared lexicographically.
However simple this seems, there are quite some subtleties to handle.

A nice, detailed explanation is also given in this spec from the OpenOffice project:
Natural Sort Option in Sort Dialog

Tokenizing

Exactly how to determine which parts of a text are numbers can be surprisingly difficult. The two complicating issues
are non-integer numbers and internationalization.

My implementation handles only integer numbers, so "a 1.3" is considered less than "a 1.11" because each is broken
into four parts ("a ", "1", "." and "3" resp. "11") and 3 is less than 11.
The need to handle fractional numbers should arise very rarely and trying to handle these opens a can of worms
(different decimal point representations in different locales, scientific formats, ...).

Digits are defined differently in different locales. Currently my implementation starts a new token whenever the
value of Character.isDigit(c) changes.

Internationalization

My minimum requirement was to compare the text parts of the strings using locale specific rules (which means you can
specify a java.text.Collator).
The code should work fine for any language that uses the number characters '0'...'9'.

Other number systems are not tested (they will probably not work, because I do not really convert the
number to a numeric value, but compare each digit separately).

Supplementary characters

Once upon a time it seemed to me that Java's native unicode support was the end of all character set problems.
No more, no more: The complete Unicode definition now contains character that cannot be represented by a single (16 bit)
char variable, see this article from sun
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

These supplementary characters are simply not handled correctly in my natural comparison implementation.

Numbers with leading zeros

How should the strings "p0002", "p2", "p02" be sorted?

Strictly comparing the numeric values of the number parts would leave the sort order undefined, as each of these strings
will be considered equal.

I have chosen to sort by increasing numbers of leading zeros: "p2", "p02", "p0002". That way Strings.getNaturalComparator().compare(String, String)
and String.equals(String) are still compatible: compare() will return 0 exactly if equals()
returns true.

Whitespace

How do "p4" and "p 3" compare?

I have choosen not to ignore whitespace, so "p4" is less than "p 3". I'd like to be alerted to difference in
the naming scheme, so I think this is reasonable. Maybe later I'll add an option to control the handling of whitespace.

Performance

First of all I don't think performance really matters in the typical application of natural comparison. You probably
don't have hundreds of thousands of files in the same directory whose names differ only in a number, do you?

Still, the very simple approach of breaking the string into parts (effectively creating two list of strings) and then
comparing each string separately might be too costly.
So I took care to avoid creating new instances where possible and do all comparisons in place.
For many scenarios the natural comparison is about as fast as Java's String.compareTo() method. If there are many
strings that have a common prefix and contain a number that differs in trailing digits only the natural comparison
is about four times slower.

Related Topics >>