Natural String Order
Treats contained numbers sensible when comparing strings
|
Natural Order
rfc 1.txt
rfc 822.txt
rfc 2086.txt
|
Java's default Order
rfc 1.txt
rfc 2086.txt
rfc 822.txt
|
(Kudos to Martin Pool for the example.) | |
When I needed a natural sort feature in my Java app I started to google and to look in a few other places (like Apache Commons). There should be plenty of code out there that does this, right? Nope. I found a page with a description of the problem and links to implementations in a couple of programming languages: Natural Order String Comparison
There's a single Java implementation, but that one cannot handle texts that contain non-ascii characters.
So it would fail in my German application, where texts contain funny characters such as these:
Im Übrigen macht's mehr Spaß, es selbst zu schreiben.
The code
Consists of a couple of static methods in class Strings: Strings.java
Here is the Javadoc.
Some junit tests: TestStrings.java
Basically there are some variants of methods compareNatural...(String, String) that directly compare
strings and convenience function, if you need a string Comparator: getNaturalComparator...().
The different variants either use either the current default locale's comparison rules, allow to specify a Collator
to use or work on 7-bit ascii characters only.
Java 1.5: There is a little bit of generics used (e.g. Comparator<String>). It's easy to change
if you require the code to work on older Java versions.
Basics and Intricacies
Basically natural comparison works by breaking each string into parts where each part is either a number or text-only.
Number parts are compared according to their corresponding numeric value while text parts are compared lexicographically.
However simple this seems, there are quite some subtleties to handle.
A nice, detailed explanation is also given in this spec from the OpenOffice project:
Natural Sort Option in Sort Dialog
Tokenizing
Exactly how to determine which parts of a text are numbers can be surprisingly difficult. The two complicating issues are non-integer numbers and internationalization.
My implementation handles only integer numbers, so "a 1.3" is considered less than "a 1.11" because each is broken into four parts ("a ", "1", "." and "3" resp. "11") and 3 is less than 11. The need to handle fractional numbers should arise very rarely and trying to handle these opens a can of worms (different decimal point representations in different locales, scientific formats, ...).
Digits are defined differently in different locales. Currently my implementation starts a new token whenever the value of Character.isDigit(c) changes.
Internationalization
My minimum requirement was to compare the text parts of the strings using locale specific rules (which means you can
specify a java.text.Collator).
The code should work fine for any language that uses the number characters '0'...'9'.
Other number systems are not tested (they will probably not work, because I do not really convert the number to a numeric value, but compare each digit separately).
Supplementary characters
Once upon a time it seemed to me that Java's native unicode support was the end of all character set problems. No more, no more: The complete Unicode definition now contains character that cannot be represented by a single (16 bit) char variable, see this article from sun http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
These supplementary characters are simply not handled correctly in my natural comparison implementation.
Numbers with leading zeros
How should the strings "p0002", "p2", "p02" be sorted?
Strictly comparing the numeric values of the number parts would leave the sort order undefined, as each of these strings
will be considered equal.
I have chosen to sort by increasing numbers of leading zeros: "p2", "p02", "p0002". That way Strings.getNaturalComparator().compare(String, String)
and String.equals(String) are still compatible: compare() will return 0 exactly if equals()
returns true.
Whitespace
How do "p4" and "p 3" compare?
I have choosen not to ignore whitespace, so "p4" is less than "p 3". I'd like to be alerted to difference in the naming scheme, so I think this is reasonable. Maybe later I'll add an option to control the handling of whitespace.
Performance
First of all I don't think performance really matters in the typical application of natural comparison. You probably don't have hundreds of thousands of files in the same directory whose names differ only in a number, do you?
Still, the very simple approach of breaking the string into parts (effectively creating two list of strings) and then comparing each string separately might be too costly. So I took care to avoid creating new instances where possible and do all comparisons in place. For many scenarios the natural comparison is about as fast as Java's String.compareTo() method. If there are many strings that have a common prefix and contain a number that differs in trailing digits only the natural comparison is about four times slower.
- Login or register to post comments
- Printer-friendly version
- skelvin's blog
- 4952 reads





