SIMD for LinAlg API: Part I
For those that are not familiar with SIMD, which is short for Single Instruction Multiple Data i will give a short introduction. If you never worked with low-level assembly or C code, SIMD is a bit hard to understand but i will try to explain it. Basically, using SIMD requires an understanding of how a CPU works and how a CPU processes data between memory and its various registers. But this is just the beginning of the story. Since SIMD is implemented by various CPU vendors like Intel and AMD, there are, unfortunately, different types of SIMD instruction sets for different CPU architectures.
SIMD was introduced to improve speed of data processing on PCs in particular for those applications that need to perform a lot of complex opereations and process a lot of data fast. Those applications are for instance Games, Animations and other Mulitmedia applications.
If you run those applications on a normal CPU without SIMD the CPU usually only processes one instruction at time, that is, per CPU tick. Sometimes the applications mentioned above require to process more than one data - or, multiple data - per tick. Thus, the idea of SIMD was born. If you are using SIMD in your application, this basically means that you are processing multiple data in parallel by one single CPU / FPU instruction.
Sounds nice right? So why do not all applications benefit from this great technology? Well, to be honest, its not the problem of technology rather than a problem of implementation. If you want to support SIMD, you will have provide an assembly language implementation of the functions that require fast CPU processing.
Today, there are various libraries available i.g. from Intel that provide common math functions that support SIMD, yet, if your API only uses a few functions that require SIMD support, well using another lib just would bloat up your own code.
MMX, SSE and 3DNOW
As i said above, there are different SIMD implementations availabe today. The two most popular implementations used, in particular in 3D Games or real-time applications are Intels Streaming SIMD Extensions called SSE and AMDs 3DNOW. MMX which is short for Multi Media Extension is not that much used today.
Intel and AMD are the two global players on the CPU market today so if you write a cross-plattform API you will have to support these CPU architectures.
Both, SSE and 3DNOW make use of special FPU - Floating Point Unit - registers that can perform complex math operations, such as Vector computation. For example, if you want to to compute the square root, that is the length of a Vector using SIMD, you can move the values of the Vector components into the corresponding register and look up the square root in a Look up table of the CPU. This is faster than computing it by using i.g. the Java Math.sqrt() method..
64 Bit FPU data types
The most important fact why SIMD is so much faster than SISD - which means Single Instruction Single Data - is that it makes use of special 64 Bit registers that can store and process two 32 Bit values, also known as DWORD or double word. These registers are all implemented on the FPU, so the CPU does not need to deal with them. This also has the advantage that the CPU can process normal application data and can delegate complex math operations to the FPU. So, basically, a fast FPU and huge RAM memory will speed up any 3D application.
32 and 64 Bit Archtictures
With the emergence of x64 Bit architectures, things become more complicated. If you want to support any imaginable PC architecture you have to provide at least implementations for Intel x32 and 64 Bit and AMD x32 and x64 bit architectures. So you will end up at least with 4 small projects that almost provide the same functionality.
Java, C, Assembly
Now that you know your various "stakeholders" at the system side, the other "stakeholders" at the software side show up. If you want to support any SIMD technology from above, you will have to write a JNI wrapper in C that implements the assembly code. Java - currently - lacks the ability to load native assembly files, so you have to use C.
What do we learn from that? Well there are many aspects to consider and many pitfalls to fall in, but, actually, as you will see in another Blog, things are not that much complicated, at least, if you have already had a look at SIMD before. Basically, you will need a few C structs and some functions that provide the assembly code and write a Java class that loads the library including error handling etc. The most important thing when working with SIMD is the processor type detection and the processor feature detection using the CPUID instruction. CPUID instruction is available on all common processor types.
The Java JNI Tutorial provides a good introduction to the Java invocation API and on Intel Developer Site and AMD homepage you will find valuable information on how to deal with various processors and their features.