The Source for Java Technology Collaboration
User: Password:



Tom White

Tom White's Blog

A Faster Java Regex Package

Posted by tomwhite on March 27, 2006 at 12:02 PM | Comments (4)

Anders Møller's dk.brics.automaton is a Java regex package whose main claim to fame is that it is significantly faster then all other Java regex libraries, including the java.util.regex classes in the JDK. Like many things in computer science, the speed gains come at a price. In this case, the regular expression language supported is not as rich as the Perl 5 syntax that is prevalent in today's tools, including the JDK implementation (which has only minor variations from Perl 5). That said, for some applications this trade-off may be worth it. (For example, it is going to be used in Nutch - see discussion on this thread.) In the rest of this post I shall try to explore this trade-off a bit more.

Code Comparison

Let's start by looking at how to use the new library. First, here's a JUnit test that does a simple match using the java.util.regex classes:

public void testJdkRegex() {
  Pattern p = Pattern.compile("a|ab");
  Matcher m = p.matcher("ab");
  assertTrue(m.matches());
}

It's possible to do the same in one line: assertTrue("ab".matches("a|ab")), but I want to compare it with the equivalent using dk.brics.automaton:

public void testAutomatonRegex() {
  RegExp re = new RegExp("a|ab");
  RunAutomaton ra = new RunAutomaton(re.toAutomaton());
  assertTrue(ra.run("ab"));
}

Some of the naming may seem odd: the RunAutomaton class with its run method in place of the Matcher class with its matcher method. This is because dk.brics.automaton is actually a library for Finite-State Automata (FSA) - also known as Finite State Machines - and any regular expression can be implemented as an FSA. Indeed, the RegExp class is simply a vehicle for compiling string representations of regular expressions into Automaton instances. So, the naming can be understood as simply exposing the implementation to the user.

Determinism

Let's look now at the issue of determinism for this is at the heart of what makes dk.brics.automaton different.

The regular expression a|ab is compiled into the FSA shown here:

The non-deterministic finite-state automaton for the regular expression a|ab

Circles represent states. Arrows represent transitions with the label showing the character that matches. The start state is pointed to by the incoming arrow on the left. End states are shown as double circles. For this FSA you can see the two paths for the matches a and ab.

This FSA is in fact non-deterministic, known as a Non-deterministic Finite Automaton (NFA). If you are in the start state and you read an a character then there is more than one option to take - you can either take the upper branch or the lower branch. There are various ways of coping with the non-determinism (e.g. look-ahead, or backtracking), but suffice to say, it makes the NFA algorithm more complex. If the FSA is deterministic (a Deterministic Finite Automaton, or DFA), everything is so much easier - and, crucially, so much faster. You have a simple operation to carry out after seeing each character of input: is there a transition that matches? If so, carry on till the end, otherwise stop as the match failed.

This is the trick that dk.brics.automaton uses. It converts NFAs into DFAs. Here's what it does to a|ab:

The deterministic finite-state automaton for the regular expression a|ab

Now there is only one choice in the start state, so it is deterministic.

What Are Regular Expressions Exactly?

So what's the catch? Why didn't Sun use the algorithm for converting NFAs to DFAs? FSAs are a great way to implement regular expressions, but unfortunately, the regular expressions we use are quite a bit more advanced than "regular expressions" in the strict theoretical sense. As Jeffrey Friedl says in Mastering Regular Expressions

... you should probably stop calling it an NFA, and start using the phrase "nonregular expressions," since that describes (mathematically speaking) the new situation. No one has actually done this, so the name "NFA" has lingered, even though the implementation is no longer (mathematically speaking) an NFA.

The best example of a feature that has been added to regular expressions is the support for back references. DFAs cannot support back references, so dk.brics.automaton can't support them (see the javadoc for the regular expression language that it does support). The JDK regex implementation can, since it is NFA-based.

Conclusion

The bottom line is that the JDK regex implementation supports a rich set of regular expressions, whereas dk.brics.automaton supports only the basic features. If you are not concerned about speed - for example, you are just doing a few matches on strings of modest length - then use the JDK implementation. However, if regular expressions are a bottleneck in your program and you don't need advanced regex features, then dk.brics.automaton might be a great fit.

Resources

  • The best introduction I've seen to FSAs, particularly the difference between deterministic and non-deterministic ones, is the second chapter of Speech and Language Processing by Daniel Jurafsky and James H. Martin.
  • If you want to improve your practical regex skills - then Jeffrey Friedl's Mastering Regular Expressions is a must read. It's really good on how to get the best out of a NFA engine.
  • I generated the diagrams using the dot program in Graphviz, which was very easy, since the Automaton class provides a handy toDot() method.
  • The same issues of determinism and non-determinism are present in XML schemas. (E.g. see this nice summary on MSDN.) W3C XML Schemas forbid non-determinism, whereas RELAX NG schemas don't.
  • Perl 6 radically re-works regular expressions. Lots of interesting ideas here!

Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • I wonde why JDK regexp can't optimize what algorithm to use when compiling regexp. Then we would get the best parts out of both...

    Posted by: euxx on March 27, 2006 at 03:59 PM

  • Tom--nice article, very clear, thanks. I take issue only with the fact that you don't list what "advanced" regex features are not available with the automaton package--the only example you give is "The best example of a feature that has been added to regular expressions is the support for back references." Ok, what else? Is there a point-by-point list of what can't be done with DFAs? I get the (subtle) sense you are attacking a straw man; for my part, I can't recall the last time I used back references. But thanks for sharing. Patrick

    Posted by: pdoubleya on March 28, 2006 at 03:11 AM


  • euxx,
    Good question. I think it would be possible to determine which algorithm to use depending on the regex - indeed, according to Friedl's book this is a strategy some versions of grep use. However, in the context of the Java API for regular expressions I don't think it would work since the API offers a java.util.MatchResult, which dk.brics.automaton wouldn't be able to fulfil. In other words, dk.brics.automaton can only say whether something matched or not. If it did match, it can't say how.

    It might be worth trying to optimize the String.matches method, but even this is not guaranteed to work, since it depends on the cost of compiling the regex, so any speed gains a DFA has might be outweighed by a longer compile time.

    Patrick,
    Thanks for your comments. I'm afraid i don't have a point-by-point list for you. However, a look at the javadoc for dk.brics.automaton.RegExp gives you and idea of what is supported. That is, not much: alternation, repeats, character classes, unicode characters and not much more. If you compare this to the javadoc for java.util.regex.Pattern you will see the JDK API is much richer, offering all kinds of special constructs. In addition, as mentioned above, you can't find out details of a match, which in practice is often important. You can only find out if something matched or not.
    I wasn't trying to construct a straw man, I was just trying to tease out the circumstances for choosing one library over the other. I think I understand the trade-offs a bit better now. Obviously, when trying to make the decision in a particular case, profiling will have the final say.

    Cheers, Tom

    Posted by: tomwhite on March 28, 2006 at 12:24 PM

  • version 1.10-1 seems to have a serious bug?


    C:\>java -cp .;automaton.jar retest x{2,5} x xx xxx xxxx xxxxx
    0 : x
    1 : xx
    0 : xxx
    1 : xxxx
    0 : xxxxx


    1.7-2 on the site gives the expected answer: 0,1,1,1,1

    retest.java looks like:


    import dk.brics.automaton.*;
    public class retest
    {
    public static void main(String[] args)
    {
    if (args.length

    cheers
    -b

    Posted by: brandon733 on June 11, 2007 at 01:31 PM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds