Skip to main content

Wish list for java array (or numeric collections)

Posted by carcassi on September 1, 2012 at 6:53 AM PDT

If you have been doing numeric calculations in Java for some time, you'll have learned to both love and hate numeric arrays.

For the love part:
* they are safe. No garbage data. All access is guarded.
* they are fast. A simple loop to calculate the average of an array of 100,000 elements on JDK 1.7, Intel Core i7 Q840 1.87GHz, takes about 100,000 nanoseconds, or one nanosecond per element.

For the hate part:
* there are 6 number arrays (byte, short, int, long, float double). If you need to write a function that works with all, you either have to convert (expensive) or provide 6 implementations (can't use primitive with generics, and generics and arrays don't mix well anyway)
* you can't have immutable references. If you return an array from an object, you have the option of being efficient but unsafe (you return the array directly, thus exposing your internal state) or being safe but inefficient (you return a copy, even if 90% of the uses are reads)
* you can't give "views" of other arrays. To pass a sub-array you need to actually copy it. NIO buffers, for example, have to implement that functionality, like so many other have to in other contexts, including me.
* their are concrete classes. Or better, we don't have a superclass for primitive/number collections.

This props up various attempts, all incompatible, to patch these deficiencies. If you search in ohloh for DoubleList
http://code.ohloh.net/search?s=DoubleList&browser=Default&pp=0&fl=Java&m...
you'll find some of them.

Naturally, I am in that group too. I have my "solution", which can't really be a solution because it's not standard. So, if I have to use some java math library to do something, I still have to convert in and out. Yet, I'll go through some of the details, because some things do work, and it's the kind of things I'd like to see.

What we need are numeric collections. The current collections don't mesh well, for the usual reasons: they are collection of objects, not primitives; generics does not work with primitives; generics is invariant.

The basic thing one need is an iterator for external iteration (yes, internal iteration is nice, but it becomes difficult when you need to iterate over more than one collection at a time, possibly in different patterns). Following both the java.lang.Number and java.util.Iterator conventions, I have:

public interface IteratorNumber {
   
    boolean hasNext();
    float nextFloat();
    double nextDouble();
    byte nextByte();
    short nextShort();
    int nextInt();
    long nextLong();
}

In the same way that Iterator allows iteration of a collection, and Number allows to use a numeric value "without caring" with a single binding, this interface allows to iterate a numeric collection with a single binding. One can also image implementation of these that do a safe cast (i.e. throw an exception if the convertion overflows), or wrappers that do that. Again, similarly to java.util.Number and its sub-classes, I have:

public abstract class IteratorByte implements IteratorNumber {

    @Override
    public float nextFloat() {
        return (float) nextByte();
    }

    // Implement all except nextByte()
   
}

This accomplishes two purposes: when implementing I only have to implement one method, and allows the client to detect what type of numer it is (a byte instead of a double). So that, in case one _does_ want to have different bindings for the different types, one can still do it!

Some kind of mechanism like this is the basic minimum. One could think of using Iterator and Iterator, but they have the following problems:
* next().doubleValue() means there was an object created in between (the Number or Double). If the code can be inlined and escape analysis kicks in, that's not an issue. But, given that this is a basic interface, and you may have several implementations, this would not (currently) happen.
* given that generics are invariant, you'd have to always use Iterator instead of Iterator. And working with wildcards all over the places is not pretty.
On the other side, IteratorDouble as implemented before, could easily implement Iterator too. The Double next() would only have one implementation (in the IteratorDouble superclass), which would refer to the nextDouble() call, so it could be inlined more easily, escape analysis may kick in and so on.

Regardless of the details: I think there is a need for a numeric iterator! :-)

Apart from the iterator, we need some basic collections. With lack of imagination (which is often enemy of good design), I have a CollectionNumber, which only supports sequential access, a ListNumber, which support random/indexed access. These have abstract implementations in CollectionDouble and ListDouble, again so that most of the implementation is done and so that a consumer can tell which type is in the collection.

Lastly, there is a set of ArrayDouble, ArrayFloat, ... which are just wrappers around arrays. And here's the punchline: if you code to these _directly_, through the magic of inlining, you suffer _no_ performance loss. Zero. Which means: using these set of interfaces, I am losing _nothing_. I can still provide 6 different bindings to ArrayXxx, which would all perform in the same way like I'd do with plain arrays. But, if performance is not that criticial, I _can_ code to ListNumber or CollectionNumber, and cover all cases in one. In fact, if only a couple of concrete classes are instanced, you'll also suffer no performance loss.

Again, regardless of the details: we need something like this.

I want to be able to make the decision to code to the general case first. Once I see that performance is an issues, I can take, say, the ArrayDouble case and give a second implementation for that, simply by switching on the type of the class, and then rewriting the same method with a different parameter type (the rest stays the same). In fact, it'd be nice if the VM did that for me, but that's another story.

The other nice thing about the wrapper array class is that I can disable the write. It's "read-only", not as good as immutable, but in many cases it works the same.

You can also create a "safe copy wrapper", which if accessed in read only, does not copy, but at the first write, would make the copy of the array.

These few things cover at least the problems I mentioned before. The solution I showed I think fits already the patterns established by java.lang.Number and the Collection classes. So at least does not feel "foreign".

Related Topics >>

Comments

Your article about arrays of primitive types struck a very ...

Your article about arrays of primitive types struck a very strong chord with me. I am an (amateur) astrophotographer and I do all my post-processing with my own Java code (being a retired software developer). Images from my camera have 21 megapixels at 16 bits per pixel per colour channel. So to pack 21 megapixels in smallest memory each value needs to be of type short but the image is then still over 120 Mbytes in memory. I need to stack multiple exposures to reduce noise at high ISO sensitivity, so I use accumulator images of 32 bits per channel - int. For post-processing, convolutions are done fastest in frequency space so I need (complex) pairs of floating-point images. For that I use 64-bit (double) images rather than waste time programming all the same thing again for float. I use 8 bits per channel for final results so I can save as JPEG for the web. So I ended up programming 4 different image classes: Image8, Image16 and Image32 (all integer), and Image64 (double). Because of the number of different image processing operations I want to do, each of those classes is nearly 5,000 lines of code. All because at the heart of each is a 3D data array which has to be of primitive type, with the fastest possible access for whole-image operations (by which I mean I need direct access to the data array within each class in order to loop through it efficiently).

What I really wanted to write (just once!) was

public class Image <T> implements Image
{
private T data [][];

//...
}

where T is a primitive data type. Let the compiler then generate multiple classes if necessary.

Another issue was the 6-byte overhead per component array in Java's multidimensional arrays. If you naively represent the image data as [x][y][b], where b is band (aka channel), a 3-channel (RGB) image occupies twice as much memory as [b][i], where i = x + y * width. That is really significant with complex pairs of 21-megapixel 64 bits-per-channel images! And for this reason too I had to implement my own image file reader/writer. Otherwise the data gets loaded in an inefficient structure that would have to be rearranged into my preferred one. Some compromise was needed so I do use JAI ImageIO for 8- and 16-bit images (TIFF and JPEG). And that means I do use java.awt.BufferedImage for holding those smaller images, to avoid time-consuming rearrangement).

Yes, a way of handling primitive types generically would be great but it must not compromise efficieny. I was amazed to discover recently that an in-place butterfly FFT of a 2048x2048 3-channel 64-bit double image takes less than 3 seconds on my Win 7 laptop. If you think about it that is doing over 12,000 FFTs of linear 2048-element double arrays in that time. It will be hard to keep that kind of efficiency in a more generic structure unless the compiler can convert my Image to Image8, Image16, etc automatically.

(I have not yet put a version of my software onto my web site with the FFT capability because I am bogged down on certain issues over convolution kernels - oh yes, primitive arrays again!)

Hi grelf, Those are exactly the same problems we are all ...

Hi grelf,

Those are exactly the same problems we are all stuck with... Multidimensional array is another one that I didn't touch here. What people typically do is store everything in a flat array, and keep the sizes of each dimension:

double[] data;
int[] sizes;

You can find a number of examples in various projects in scientific computing. And again: it would be at least nice to have a common interface for it!