Search |
||
A Faster Java Regex PackagePosted by tomwhite on March 27, 2006 at 12:02 PM PST
Anders Møller's dk.brics.automaton is a Java regex
package whose main claim to fame is that it is significantly faster
then all other Java regex libraries, including the
Code Comparison
Let's start by looking at how to use the new library. First, here's a JUnit test that does a simple match using the
It's possible to do the same in one line:
Some of the naming may seem odd: the DeterminismLet's look now at the issue of determinism for this is at the heart of what makes dk.brics.automaton different.
The regular expression
Circles represent states. Arrows represent transitions with the label showing the character that matches. The start state is pointed to by the incoming arrow on the left. End states are shown as double circles. For this FSA you can see the two paths for the matches a and ab. This FSA is in fact non-deterministic, known as a Non-deterministic Finite Automaton (NFA). If you are in the start state and you read an a character then there is more than one option to take - you can either take the upper branch or the lower branch. There are various ways of coping with the non-determinism (e.g. look-ahead, or backtracking), but suffice to say, it makes the NFA algorithm more complex. If the FSA is deterministic (a Deterministic Finite Automaton, or DFA), everything is so much easier - and, crucially, so much faster. You have a simple operation to carry out after seeing each character of input: is there a transition that matches? If so, carry on till the end, otherwise stop as the match failed.
This is the trick that dk.brics.automaton uses. It converts NFAs into DFAs. Here's what it does to
Now there is only one choice in the start state, so it is deterministic. What Are Regular Expressions Exactly?So what's the catch? Why didn't Sun use the algorithm for converting NFAs to DFAs? FSAs are a great way to implement regular expressions, but unfortunately, the regular expressions we use are quite a bit more advanced than "regular expressions" in the strict theoretical sense. As Jeffrey Friedl says in Mastering Regular Expressions ... you should probably stop calling it an NFA, and start using the phrase "nonregular expressions," since that describes (mathematically speaking) the new situation. No one has actually done this, so the name "NFA" has lingered, even though the implementation is no longer (mathematically speaking) an NFA. The best example of a feature that has been added to regular expressions is the support for back references. DFAs cannot support back references, so dk.brics.automaton can't support them (see the javadoc for the regular expression language that it does support). The JDK regex implementation can, since it is NFA-based. ConclusionThe bottom line is that the JDK regex implementation supports a rich set of regular expressions, whereas dk.brics.automaton supports only the basic features. If you are not concerned about speed - for example, you are just doing a few matches on strings of modest length - then use the JDK implementation. However, if regular expressions are a bottleneck in your program and you don't need advanced regex features, then dk.brics.automaton might be a great fit. Resources
»
Related Topics >>
Programming Comments
Comments are listed in date ascending order (oldest first)
|
||
|
|