Monday, January 12, 2015

Indy and CompileStatic as tag team to defeat array access times

Micro Benchmarks are Evil

They are evil, aren't they? You test a very specialized case that may have no relation to your everyday application at all. But they can show some weaknesses here and there. If they are relevant or not is a different question, that is most often answered with a "they are not". Still there are sometimes cases on which a language can improve upon. And one such case in Groovy is array access.

Array access in Groovy

For those not being aware of it, but Groovy does array access not like Java. Groovy allows the usage of a negative array index, in which case we go from the first element to the last. So a -1 denotes always the last element. Using -array.length on array will again result in a ArrayIndexOutOfBoundsException.

Benchmarking a little

To measure the extend of the problem I am using this little benchmark named fannkuch. It is based on the alioth shootouts Groovy version for fannkuch. Since I know Groovy will not perform very well on this, even with primitive optimization I don't expects too much checking this with none-static code. For those not knowing what primitive optimizations is... it is letting the compiler generate an alternative bytecode execution path based on primitives and the assumption that there are no meta class changes affecting primitives. To ensure this assumption is legal I am using guards.

fannkuch microbenchmark times in ms (JDK8_u25):
primopts Groovy12889.2358718+2718.9594152/-787.6735898
static Groovy3325.5838752+270.7189528/-266.2819292
Java561+85/-65

Which means even Groovy with primitive optimizations is slower by factor 23. Switching to @CompileStatic makes things look better, but there is still a factor 5.

Analyzing the results

Analyzing the generated bytecode will show us, that the @CompileStatic version is not doing anything strange compared to Java, only the array access parts are done different by using BytecodeInterface8 methods to access the arrays. primopts on the other hand show that besides the BytecodeInterface8 method usage, there is also dynamic access to arrays. This of course then means bad times, since beating primitives on the JVM is difficult with code, that cannot handle primitives all that well... like for example reflection.

So my next was to try if invokedynamic can improve the situation. It may at first look strange to use invokedynamic in static compiled code for something as static as this. We know all the types at compile time so a method call should be faster than any fancy thing invokedynamic could do, right? Wrong. Or I should say it depends. What we can do here is to give a very short path for the optimistic case of the array index being positive. In the original code this is done with a try-catch. But in terms of MethodHandles used by invokedynamic we can use a guard that checks the index for a positive value instead. MethodHandles do also provide an exception catching guard of some kind, but this has issues in terms of performance and how far the code can be optimized. In total the guard version has the big advantage of doing something the JVM would do anyway and thus potentially just remove the second check, making the first check very very cheap. The fallback of course is still as complex as before and there is no real speed improvement to be expected. Another part that should deserve consideration is that in invokedynamic a static call site is no where to be compared to a mutable callsite. Thanks to Java8 lambdas a lot of performance optimization effort has been going in making static callsites fast. And we have one here.

New results

This then resulted in PR #587 and updated times in our table:
primopts Groovy12889.2358718+2718.9594152/-787.6735898
static Groovy3325.5838752+270.7189528/-266.2819292
static Groovy with indy878.0258219+328.9714071/-134.1179639
Java561+85/-65

This indicates a mere slow done of 57% now. I think this is a great improvement... And while it would have been nice to be actually on par with Java here, I assume this can only be done by using Java's array access logic in the end. A slowdown like this is something I found already occurring if you check for a boolean in an if for example. So I doubt there is much more room of improvement.

As for primitive optimizations, after GROOVY-7249 and GROOVY-7251 we can also look forward to improvements in indy and normal primitive optimizations.

I will make a new blogpost of the results, once those are implemented