All modern search engines attempt to detect and correct spelling
errors in users' search queries. Google, for example, was one of
the first to offer such a facility, and today we barely notice when
we are asked "Did you
mean x?" after a
slip on the keyboard. This article shows you one way of adding a
"did you mean" suggestion facility to your own search applications
using the Lucene Spell Checker, an extension written by Nicolas
Maisonneuve and David Spencer.
Techniques of Spell Checking
Automatic spell checking has a long
history. One important early paper was F. Damerau's A Technique
for Computer Detection and Correction of Spelling Errors,
published in 1964, which introduced the idea of minimum edit
distance. Briefly, the concept of edit distance quantifies the
idea of one string being "close" to another, by counting the number
of character edit operations (such as insertions, deletions and
substitutions) that are needed to transform one string into the
other. Using this metric, the best suggestions for a misspelling are
those with the minimum edit distance.
Another approach is the similarity key technique, in which
words are transformed into some sort of key so that similarly
spelled and, hopefully, misspelled words have the same key. To
correct a misspelling simply involves creating the key for the
misspelling and looking up dictionary words with the same key for a
list of suggestions. Soundex is the best-known similarity key, and
is often used for phonetic applications.
A combination of minimum edit distance and similarity keys (metaphone) is at the heart
of the successful strategy used by Aspell, the leading
open source spell checker. However, it is a third approach that
underlies the implementation of the "did you mean" technique described
in this article: letter n-grams.
A letter n-gram is a sequence of n letters of a
word. For instance, the word "lucene" can be divided into four
3-grams, also known as trigrams: "luc", "uce", "cen", and "ene.". Why
is it useful to break words up like this? The intuition is that
misspellings typically only affect a few of the
constituent n-grams, so we can recognize the intended word
just by looking through correctly spelled words for those that share
a high proportion of n-grams with the misspelled word. There
are various ways of computing this similarity measure, but one
powerful way is to treat it as a classic search engine problem with
an inverted index of n-grams into words. This is precisely
the approach taken by Lucene Spell Checker. Let's see how to use it.
A Simple Search Application
We'll first build a very simple search interface that does not
include the "did you mean" facility. It defines a single method that
takes a search query string and returns a search result.
package org.tiling.didyoumean;
import java.io.IOException;
import org.apache.lucene.queryParser.ParseException;
public interface SearchEngine {
public SearchResult search(String queryString) throws IOException, ParseException;
}
The search result is a SearchResult object, which is a JavaBean that exposes a
list of hits (actually just the top hits, for simplicity) and a few
other properties. I have omitted the constructor and getters in the
listing here as they are boilerplate code. (The full source code is
available in the accompanying download--see the "References" section at the end of
the article.)
package org.tiling.didyoumean;
import java.util.List;
public class SearchResult {
private List topHits;
private int totalHitCount;
private long searchDuration;
private String originalQuery;
private String suggestedQuery;
}
Here's a very simple implementation of SearchEngine built with Lucene. It uses Lucene's
QueryParser to parse the search query string into
a Query that is then used to perform the search.
The Lucene Hits object is then mapped to an instance of our SearchResult class.
package org.tiling.didyoumean;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
public class SimpleSearchEngine implements SearchEngine {
private String defaultField;
private String nameField;
private Directory originalIndexDirectory;
private int maxHits;
public SimpleSearchEngine(String defaultField, String nameField,
Directory originalIndexDirectory, int maxHits) {
this.defaultField = defaultField;
this.nameField = nameField;
this.originalIndexDirectory = originalIndexDirectory;
this.maxHits = maxHits;
}
public SearchResult search(String queryString) throws IOException, ParseException {
long startTime = System.currentTimeMillis();
IndexSearcher is = null;
try {
is = new IndexSearcher(originalIndexDirectory);
QueryParser queryParser = new QueryParser(defaultField, new StandardAnalyzer());
queryParser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
Query query = queryParser.parse(queryString);
Hits hits = is.search(query);
long endTime = System.currentTimeMillis();
return new SearchResult(extractHits(hits), hits.length(), endTime - startTime, queryString);
} finally {
if (is != null) {
is.close();
}
}
}
private List extractHits(Hits hits) throws IOException {
List hitList = new ArrayList();
for (int i = 0, count = 0; i < hits.length() && count++ < maxHits; i++) {
hitList.add(hits.doc(i).getField(nameField).stringValue());
}
return hitList;
}
}
Note that an IOException may be thrown by Lucene if there is a problem reading the index (typically from disk). The finally clause closes the IndexSearcher, but propagates the exception to indicate the problem to the client, which is the MVC layer, in this case.
With these ingredients it is straightforward to write a user
interface that accepts user queries and presents the search results
back to the user. I chose Spring's MVC
framework for this. Since this is an article about search and not
about Spring, I won't present any of the code for the user interface
here--instead, please refer to the accompanying download.