Skip to main content

Fun with Lambda's

Posted by kcpeppe on November 10, 2013 at 6:34 AM PST

I just finished delivering a talk at Oredev 2013 on better concurrency in Java 8. With Lambda’s being the biggest new feature I naturally needed to address what they had to offer. I wanted a meatier example than the ones where you write a query to find all the people that make more than you do so I decided to write the example around was on how to sum up all the application stopped time records in a garbage collection log file. These records, injected into the log at every safe-point, tell us how long our application threads have been halted while waiting for the JVM to do some maintenance task (garbage collection being but one of them). The line below is an exemplar of the records of interest. What’s below is a rambling account of how I got from the imperative style of code that we’re used to seeing to the functional style that we’ll be able to use come the GA of Java 8.

Total time for which application threads were stopped: 0.0002130 seconds

The first step in summing up application stopped time is to find the record. To do this I’ve used the regular expression "stopped: (\\d+\\.\\d+)". This expression also puts the time into a matcher group. Thus the imperative code we’re used to writing looks like this.

Pattern pattern = Pattern.compile( "stopped: (\\d+\\.\\d+)");
BufferedReader reader = new BufferedReader(new FileReader(gcLogFileName));
while ( ( logRecord = reader.readLine()) != null)
    Matcher matcher = pattern.matcher( logRecord);
    if ( matcher.find())
        value += Double.parseDouble( matcher.group(1));

That doesn’t look too to bad but if we are to believe all the Lamda hype, I should be able to do better. On closer inspection we can quickly note that if I wanted to parallelize this code, I’m pretty much obliged to rewrite the whole thing. If one of the great hopes for Lamda’s, is to provide us with better abstractions for parallelism than, yes, I should be able to do better using Lambda’s. So lets start by sorting out the steps necessary to convert to a more functional style of coding.

The first thing I wanted to do is stream the data from the GC log file. I typed in new BufferedReader( new File(fileName)) and as I hit the “.”, a code completion window popped up suggesting I use the method Stream BufferedReader::lines. Having not seen that method I decided to take a look at the javadoc. In fact it turns out that there is a whole host of new methods added not only to existing classes but also to interfaces! Being able to put a default method into an interface is a new feature that relieves the implementors of that interface to be forced to implement all the methods defined by that interface. More to the point, I guess I’m going to have to cruise through the javadoc to not only find new packages and classes but to also survey all the old classes for new methods whose sole purpose is to support lamdas. But I digress.

Getting back to the task at hand, once I was able to get hold of a stream I figured I was home free. After all now all I had to do was filter the file for all of the lines containing my pattern, extract the value using a matcher group and then sum them up. After all, how hard could that possibly be? But then, we all know the devil is in the details and in this case the details, in the form of how Matcher works, did turn out to be a wee bit devilish. As you can see in next listing below, I not only needed to get a matcher from the pattern, I had to call find() on it in order to have it calculate the groups. The problem is that Pattern and Matcher’s api work all well and fine when using an imperative style of coding but using in a functional style is a bit of a challenge.

private static Pattern applicationStoppedTimePattern = Pattern.compile( "stopped: (d+\\.d+)");

public static void main(String[] args) throws Exception {
    final String gcLogFileName = “gc.log";

    Predicate applicationStoppedTime =
        line -> applicationStoppedTimePattern.matcher(line).matches();

    String stats = new BufferedReader(new FileReader(gcLogFileName))
        .lines()
        .filter(applicationStoppedTime)
        .mapToDouble(line -> extractStoppedTime(line))
        .summaryStatistics().toString());
    }

    public static double extractStoppedTime( String line) {
        Matcher matcher = applicationStoppedTimePattern.matcher(line);
        if ( matcher.find()) {
            return Double.parseDouble(
                            applicationStoppedTimePattern.matcher(line).group(1));
        } else {
            return 0.0d;
        }
    }
}

The code creates a BufferedReader as before but then to create a stream, the method lines() is called. The stream is filtered using the Predicate that matches the line against the pattern. This leaves us with all the strings that match the pattern but to get the double value from a matcher I need to first call find() and that returns a boolean. I need the matcher so that I can call group so now the fun begins. The first instinct was to write the helper method to both execution find() and matcher and then return the proper group but this just didn’t feel right. After scratching about in the javadoc and some of the source code and using Google I was resigned that I was just going to have to do something different if this was all going to work. I was going to have to collect the matcher and then use it to collect the String from the group() call. The result can be seen in the code below.

new BufferedReader(new FileReader(gcLogFileName))
        .lines()
        .filter(applicationStoppedTime)
        .map(line -> applicationStoppedTimePattern.matcher(line))
        .mapToDouble(matcher -> (matcher.find()) ?
                                      Double.parseDouble(matcher.group(1)) : 0.0d)
        .summaryStatistics().toString());

This looked better and it looks like it’s going in the right direction but it still contains an ugly if-then-else disguised as a ()?:. To come to a pure solution I want to get rid of the if-then-else but it wasn’t clear to me how that was going to happen. At this point I stumbled into a curious Brian Goetz, undoubtably *the* best tech support person I could run into for this problem. After complaining that I hadn’t used a try with resources to close the file, he took a look at what I did manage to get done. After a bit of discussion where we threw around a couple of ideas he came to an ahha moment and then out popped the code below.

new BufferedReader(new FileReader(gcLogFileName))
        .lines()
        .map(line -> applicationStoppedTimePattern.matcher(line))
        .filter(matcher -> matcher.find())
        .mapToDouble(matcher -> Double.parseDouble(matcher.group(1)))
        .summaryStatistics().toString());

The trick used here is to first map all of the GC log lines to a matcher and then filter Matcher::find(). After that we can safely use parse the group(1) and then use mapToDouble() to feed to SummaryStatistics. Finally a pure solution that eliminated the if-then-else. Better yet, instead of a filter->map->filter->map, the code now does a map->filter->map which runs about 2x faster.

After thinking about things for another minute I realized I’d just experienced a red shirt blue shirt moment and I was wearing the red shirt. I had just beamed down to planet Lambda with pretty much every example out there saying the first thing you do to a stream is filter it. This makes sense in that I didn’t want every record in the GC log but the ones reporting on application stopped time. Unfortunately I fell victim to a slavish adherence to my requirements and all of the examples that re-enforced my boxed in thinking that lead me to the original less than optimal solution. By not understanding that Pattern and Matcher were designed to the tradition of a boiler plate laden imperative style of coding I fell into the trap that lead me away from a better solution.

Next step; how much cruft can I squeeze out the code. For example, line -> applicationStoppedTimePattern.matcher(line) can be expressed as line -> applicationStoppedTimePattern::matcher and matcher -> matcher.find() can be shortened to Matcher::find. Further more, there is a new method Files::lines that will stream a file. After applying all these changes I ended up with this;

Files.lines(new File( gcLogFileName).toPath())
        .map(applicationStoppedTimePattern::matcher)
        .filter(Matcher::find)
       . mapToDouble(matcher -> Double.parseDouble(matcher.group(1)))
       . summaryStatistics().toString());

Finally we are able to see one of the advantages of this version of the code over the original. We can parallelize it with a call to parallel() just before we start the mapping process. The change looks like this; Files.lines(new File(gcLogFileName).toPath()).parallel().map(...

What would I like to do next?

I would like to add another predicate so that I might capture application running time from the log but I don't see a great way of doing this at the moment and to do that will require a custom Consumer. I'd also like to push the stream through more than one lambda on a single pass. To do that I'd need to fork() the stream and that functionality isn't included in this implementation.

On the surface it appears opinion is that the implementation of Lambda’s has been driven by the requirement to support map->reduce like functionality. Certainly the chunking behaviour defined by the Spliterator class helps simplify support for map->reduce in that it’s easier to parallelize. However there is another model that I believe Lambda’s could addressed and that is the case where you have multiple consumers interested in the same stream. In this case, I would like to fork() or spray() the stream to a number of consumers each filtering the steam polymorphicly. For example, there is a ton of useful data in a GC log file and now that I’ve got this ability to stream them at me, I’d really like to be able to have multiple consumers pick out different bits related to different events. I can imagine that new XML parsers might also like to take advantage of this capability.

This is not a criticism of Lambdas but as Brian responded when I asked him for this functionality, everyone has their pet feature that they’d like to see but he needed to focus on things that are generally useful. Point taken, unless more people get involved and demonstrate a need to not only split but fork or spray a stream I’m (rightfully) going to have to write my own fork(). Another point Brian made is that streaming to multiple consumers does require one to buffer the stream and that will consume more memory and that in turn has it’s own dangers. For example, a slow consumer might result in an OOME. So this has to be done with care. One last point simply has to do with a complexity budget. Introducing a wide variety of functionality in one shot requires developers to take on board a fair amount of complexity.

I’ve watched Brian working incredibly hard to move the platform forward and I’m sure there are a number of other features that they recognized would be useful and simply have let fall on the cutting room floor simply due to the pressure of having to meet a deadline. So, we might not have all that we what but what we do have looks pretty promising and remember, we're only at the beginning.

Related Topics >>

Comments

Very nice - I like the step-by-step exposition. The thing I ...

Very nice - I like the step-by-step exposition. The thing I take with me is your warning "Unfortunately I fell victim to a slavish adherence to my requirements and all of the examples that re-enforced my boxed in thinking ...". Indeed, that happens to me too often.

With regard to your example, when one revisits the original imperative code, one sees that it does exactly what your final version does: iterate over the the lines, get a matcher for each line, disregard any non-matching lines, get the matched group as a double. It's a 1:1 correspondence, except the imperative version is inherently serial and less concise.

Had you not been misled by your "filter-first" expectation, you'd have been able to do that code transformation with ease, instead of taking such a circuitous route. (And we'd not have this most instructive blog post.)

-- Sebastian