Skip to main content

pvmanager: a framework to deal with real-time data (part 1)

Posted by carcassi on August 3, 2012 at 11:56 AM PDT

In a previous post I outlined some of the problems one has dealing with real-time, asynchronous data. Since I have been working on a library to handle those issues, I'll start to go through some of the design.

The library is open source, and available at http://pvmanager.sourceforge.net. It's being developed and used in the EPICS accelerator control community, which comprises various laboratories and universities across the globe, working on control systems for particle accelerators and light sources. Specifically is also being used in Control System Studio, which is an EclipseRCP based application that aims to bring an integrated environment for operator/engineering/physics tools.

Though the pvmanager library comes already bundled which EPICS support, the core does not depend on it at all (I may physically separate the two if there is interest). I promise won't bore you with the bits that are control system specific.

So, back to the requirements. We need a framework that:

  • Can get data from asynchronous sources
  • Allows the client to work with multiple sources, without having to code for each
  • Allows decoupling the rate of the source from the rate of the destination. Which means:
    • Rate limiting: the destination must be able to specify a maximum rate above which it becomes inefficient (e.g. graph refreshed 50Hz or less, database write batched at 1 Hz or less, etc...)
    • Rate throttling: if the destination can't keep up, automatically decrease the rate of notifications.
    • Pause/resume: keep receiving the live data, but temporarily skip the notifications.
    • For all these, the pipeline needs to be told what to do with the extra events, and define how it will fail
  • Allows pieces of each client pipeline to be reused (like caches, queues, mathematical operations, aggregation operators, ...)
  • Whenever you add a piece, you should not have to worry about all the minutia of how all of the previous things are done, or how the multithreading works
  • Allow to define where the notification should go (e.g. swt ui thread, swing edt, ...)
  • Provides a fluent API to build the correct pipeline for each case

In one sentence: it needs to transform an unpredictable possibly jittery source of events in an even, predictable one, including its the modes of failure (actually, that's the most important part).

The building blocks

Assuming that, at some point, we will need to notify somebody that there is new data, we need first to have a piece of code that has aggregated/computed that piece of data. That seems the perfect spot for an interface (preferably a SAM):

public interface Function; {
    T getValue();

This naturally allows to have nested functions, like so:

public class ToString extends Function {
    private Function arg;
    public ToString(Function arg) {
        this.arg = arg;
    }
    public String getValue() {
        return String.valueOf(arg.getValue());
    }
}

On the other side, we need a place for the data source to put the data. We are going to assume that the payload is already "prepared", which means it's not going to vary between notifications: it can be cached and queued. So we define a placeholder for the data:

public class ValueCache extends Function {
    public T getValue() { ... }
    public void setValue(Object newValue) { ... }
    ...
}

We will end up with a tree made of Function objects, with the leaves being ValueCaches. But some of these functions need to be special: part of the tree must be calculated at the source rate (i.e. every time a new asynchronous event happens) while another part must be calculated at the desired rate (i.e. up to the rate limit, and only if the destination is ready to get another notification). In between, we have the objects that will need to decouple the rate. We define them as:

public abstract class Collector extends Function> {
    public abstract void collect();
    public abstract List getValue();

Here's where the decoupling happens. When the data source has a new notification, it will tell the collector that there is new data to be taken. The collector will store it, and keep it ready when its parent function will ask for the value. At that point, it may have accumulate zero, one or many payloads. The collector could be implemented as a queue:

public class QueueCollector extends Collector {
    public void collect() {
        T newValue = function.getValue();
        ...
        buffer.add(newValue);
        ...
    }

    public List getValue() {
        ...
        List; data = new ArrayList(buffer);
        buffer.clear();
        return data;
        ..
    }

Or it could be a cache, could be a time cache (i.e. last n seconds of data), and so on. Each client, will have it's own graph of functions that will look like this:

Thread policy

We want that, whenever we are adding a Function class, we are not bogged down by the synchronization policy. We want to establish the following rules:

  • Each Function object lives in one and only one pipeline
  • Each Function object can assume to be called always on the same thread (i.e. it will be synchronized by the framework)

This way, the state of the Function object can simply be member variables, without extra locking, ThreadLocal variables or threadsafe data structures.

On the left side, the root function must always be called using the same lock. The same function, for example, could be used (or the object responsible for the final notification):

    ...
    Object newValue;
    synchronized (function) {
        newValue = function.getValue();
    }
    ...

This will make sure that everything that happens at the desired rate will be properly synchronized. On the other hand, the data source will need to make something like this:

    ...
    Object newValue = extractNewValue(dataSourcePayload);
    synchronized (collector) {
        cache.setValue(newValue);
        collector.collect();
    }
    ...

This will make sure that everything that happens at the source rate will be properly synchronized. Parts that depends on different collectors can still go in parallel, because they are independent. But notifications that work on the same collectors will be serialized.

Naturally, the collector needs to be written well, because it is the only thing that transitions the data from one thread to the other. If we go back to our queue, we should add the appropriate synchronization:

public class QueueCollector<T> extends Collector<T> {
    public void collect() {
        T newValue = function.getValue();
        ...
        synchronized(buffer) {
            buffer.add(newValue);
            ...
        }
    }

    public List getValue() {
        synchronized(buffer) {
            List data = new ArrayList(buffer);
            buffer.clear();
            return data;
        }
    }

We make sure that the buffer is properly locked.

What we have so far

With these simple definitions, we have:

  • Building blocks to create a pipeline to aggregate and transform data
  • A simple locking policy for all the pipelines we are going to create (no more: ah, in this case I am doing this, but here I am doing this other thin, ...)
  • Building blocks can be recycled across client pipelines

In the future we'll need to understand how do we hook this up to a data source, how do we hook up the final notification, how do we actually create the whole tree, and so on.

AttachmentSize
FunctionGraph.png70.48 KB
Related Topics >>