Skip to main content

String's equals method isn't always enough

Posted by joconner on June 28, 2006 at 1:24 AM PDT

I read Ethan Nicholas' blog about intern'd strings with great interest. I agree with his assessment that using '==' to compare String objects is almost never correct. He suggests that String's equals method is superior. His description of the intern method is excellent, and I wouldn't want to detract from his comments. He is right on...the equals method is the right way to compare many strings. However, I think you need to know more about string comparisons, especially why equals does not provide the correct results all the time.

The Problem with String's equals

The problem shows up when you want to compare text linguistically...like you do when you use a standard word dictionary. The String class just doesn't have the ability to compare text with natural language in mind. String's equals and compareTo methods compare the individual char values in the string. If the char value at index n in string1 == the char value at index n in string2 for all n in both strings, the equals method returns true. So what's the problem?

The problem is that there are often multiple ways to represent the same text in Unicode. For example, the name "Michèle" can also be represented as "Miche`le" in Unicode. The second version of the name uses a "combining sequence" ('e' + '`') to represent 'è'. String's simplistic equals method says that these two Strings have different text. They are not lexicographically equal, but they are definitely equal linguistically. Combining sequences are perfectly valid representations of accented characters.

The following code snippet prints this: The strings are not equal.

String name1 = "Michèle";
String name2 = "Miche\u0300le"; //U+0300 is the COMBINING GRAVE ACCENT
if (name1.equals(name2)) {
  System.out.println("The strings are equal.");
} else {
  System.out.println("The strings are not equal.");
}

The Problem with String's compareTo

The compareTo method is flawed too...from a linguistic perspective anyway. It compares the char values similar to equals. You can read up on the details of the comparison in the javadoc. The end result is a very simplistic decision about which string precedes the other. Using compareTo, your code would think "Hat" precedes "cat". Why? Because uppercase 'A' through 'Z' come before lowercase 'a' through 'z' in the Unicode (and ASCII) table. Any kid knows which of these comes first in a dictionary, but the compareTo method is blissfully unware that its simplistic evaluation would confuse most people.

The following snippet prints this: Hat < cat

String w1 = "cat";
String w2 = "Hat";
int comparison = w1.compareTo(w2);
if (comparison < 1) {
  System.out.printf("%s < %s\n", w1, w2);
} else {
  System.out.printf("%s < %s\n", w2, w1);
}

When are These Results Wrong?

If you're trying to sort a list of names, the results of String's equals and compareTo methods are almost certainly wrong. If you want to search for a name, again the equals method will subtly trip you up if your user enters combining sequences...or if your database normalizes data differently from how the user enters them. The point is that String's simplistic comparisons are wrong whenever you are working with natural language sorting or comparisons. For these types of linguistic comparisons, you need something more powerful. And what might that be?

What is a Collator?

The java.text.Collator class provides linguistic comparisons. It's not as fast as String's compareTo, but it is supposed to be correct for linguistic comparisons. If correctness in that situation is important to you, you have to use this class.

If you used a Collator instance, you'd see that "cat" really does come before "Hat", and that "Michèle" and "Miche`le" can be considered the same in many situations, usually those in which natual language processing is important.

You should know that a Collator is locale sensitive. That is, it performs differently depending upon the locale for which it is created. Different geographic regions compare words differently, using different rules for which letters and accents come before (and after) others. Let's look at some comparisons using a Collator object.

The following comparison prints this: The strings are equal.

...
Collator collator = Collator.getInstance(Locale.US);
String name1 = "Michèle";
String name2 = "Miche\u0300le";
int comparison = collator.compare(name1, name2);

if (comparison == 0) {
  System.out.println("The strings are equal.");
} else {
  System.out.println("The string are not equal.");
}

If you browse around the Collator javadoc for long, you'll notice that it has various properties that you can set to modify its comparison behavior. That stuff is interesting, and you'll need to learn more about it sometime, but it's not important for the discussion here. The main point I want you to understand is simply this: String's equals method does not work for everything. Although equals or even == might be fine for testing some types of Strings (like Button action commands), these methods are not sufficient for more complex comparisons.

The Conclusion

Take a look at the Collator class to determine when a linguistic comparison might be more appropriate than a simple == or equals check. You might be surprised to find that you've been using the wrong API in many places. I'm not saying you have to go through your code to change every occurance of one API to the other, but you might start thinking about where a Collator is the correct choice. It's particularly useful when sorting and searching text.

One Last Thing

Can you guess which class (String or Collator) can help you match the word "Michèle" even though a user enters "Michele" (without the accent) into your application?

Related Topics >>