I mentioned in my previous post that I needed to do some serious performance optimizations on my Lua code, and indicated in an edit that after a day’s worth of effort, I managed to speed things up by roughly a factor of 13. I have some more work to do, but in the hopes that the steps I’ve taken so far are helpful to anyone else going through a similar process, allow me to describe some of them before they slip my mind.

Operator Overloads and Types

I love that Lua allows us to overload operators with metatable entries such as __add and __eq. It makes code look a lot cleaner without a mess of named function call syntax all over the place, especially in math heavy code. But if one wants to support mixed types, then these operator functions will inevitably need to do some type-checking dispatch. I definitely wish to support mixed types, because I intend to use my fixed point number format all over the place, but would like to easily intersperse integer literals into the equations. For example, fixedPointVar + 1 should just work, producing a fixed point value one greater than the variable given, according to the proper fixed point math.

My type checking code was pretty nasty, though, and produced a lot of trace aborts with the message “NYI: FastFunc ffi.meta.__tostring”. (“NYI” refers to Mike Pall’s list of operations that are Not Yet Implemented in the compiler, instead falling back to interpreter.) This was because I was looking up the appropriate functions using tostring(ffi.typeof(arg)) as a key.

The quick but ugly fix was to simply rely exclusively on functions that assumed both arguments were of the appropriate type. So instead of fixedPointVar + 1, I would declare a local variable outside of the loop, set it to the fixed point representation of 1, and then call fixedPointVar:Clone():Add(one) within the loop.

In many cases, I knew that the prior value before the mathematical operation would no longer be needed, so to save the JIT compiler from having to figure out allocation sinking, I could just remove the :Clone() call from the above, and the addition would merely be performed in place, storing the result in the original variable.

Making the above changes throughout my code increased performance drastically, most notably because the JIT compiler was finally able to handle all of those inner loops which performed the kind of math which computers excel at handling, rather than always falling back to the interpreter and needing to do additional string-based table lookups.

I have since learned that I can use tonumber(ffi.typeof(arg)) for getting a key that uniquely represents a ctype, instead of getting a string representation of the ctype. I haven’t yet tested this, but I hope that it might allow me to return to using standard operators most of the time. I know LuaJIT is supposed to be excellent at optimizing heavily biased branches, so as long as the type checking doesn’t generate a trace abort, the compiled code might end up being plenty performant. Additionally, the allocation sinking optimization might be more than capable of avoiding the cost of temporaries produced by binary operators. My early tests that indicated otherwise were probably also running afoul of unrelated trace aborts, and as a consequence I may have become needlessly paranoid. I hope to return to this area of the code in the future and see if I can clean things up and retain the performance. Because honestly, the following code is just downright ugly:

return Vector3_Q1616_Cross(axis, position):Normalize():MulScalar(rate:Clone():Mul(positionAxisDistance))

While this equivalent expression is much nicer:

return Cross(axis, position):Normalize() * (rate * positionAxisDistance)

Tail Call Interference

One tiny little changed that helped quite a bit was placing parentheses around the return value of Vector3:Clone(). To be completely honest, I’m only partially clear on why this affects LuaJIT. According to Programming in Lua, return (g(x)) prevents tail call optimizations, because the runtime has more work to do after calling g(x) but before returning from the outer function (the parentheses force possibly multiple return values to be reduced down to only one). But in this case, a tail call to a built-in function was interfering with LuaJIT’s trace. By forcing the code to not contain a tail call, LuaJIT was able to continue the trace and properly compile the code. I found the solution thanks to yet another comment by Mike Pall after searching for the trace abort message “leaving loop in root trace”.

Pointer Arithmetic on Data with Non-Power of 2 Size

This one was nasty, and the trace abort message looked nasty too: “bad argument type”. It turns out that LuaJIT really strongly dislikes taking the difference between two pointers that point to some type whose size is not a power of 2. In my case, the Vector3 data type was 12 bytes long, containing three fixed point components which were 4 bytes each. Once I figure out what the problem was, the fix was easy. In any case where I needed to take the difference of two pointers and get the distance in terms of element count, I needed to cast the pointers to byte pointers first, and then manually do the division by the element size. Conveniently, my type system was set up so that I could do this once when defining a particular array type, and therefore choose whether I needed to do the cast and division or not. The result was the following ElementDistance function that could be accessed as an upvalue by any metatable function that needed it (with a related byte distance function to go alongside it):

Array Index Wrapping and Modulo

I had a particular function that needed array indexes to wrap around from the end back to the beginning. For situations where the array length is not guaranteed to be a power of 2, using an integer modulo operation is the standard choice. Lua, being designed entirely around floating point numbers until just recently, doesn’t have such an operation built in. I had been using the floating point equivalent, math.fmod(), but that was yet another small thing causing a trace abort, “NYI: FastFunc math.fmod”.

I suppose I could have tried to use index - math.floor(index / arraySize) * arraySize; that option didn’t cross my time, and may not be very great anyway. Instead, I just implemented a quick lookup table, since I knew I was working with very finite number ranges (array sizes of 3, 5, and 6). While possibly still not optimal, it was enough to allow LuaJIT to compile the code in question, and thus netted me another significant improvement. I might at some point experiment with a fixed cdata array instead of a Lua table with integer keys. Or possibly using an if/elseif chain or gotos to emulate a switch statement, as LuaJIT might have some good tricks available for optimizing such constructs (like by creating a jump table).

Improved Sort

In C++, I was able to depend upon std::sort() to be efficient, and to inline whatever comparison function I provided, if appropriate. When I ported the code to Lua, I could no longer rely on std::sort(), because the comparison function was no longer available to the C++ compiler, being that it was Lua code. I couldn’t rely on Lua’s built-in table.sort() either, because I wasn’t sorting a table; I was sorting cdata elements in an externally allocated array of memory. So I wrote my own sort function.

But unsurprisingly, my first sort function was pretty poor because I had simply been picking an easy to implement sort algorithm; my initial goal was just to test if the rest of my generation code was working properly at all. In fact, I had already switched algorithms once, from an incredibly lazy implementation of a bubble sort to a less lazy insertion sort.

After profiling revealed that my sorting (of roughly twenty-six thousand items) was taking up a huge proportion of execution time after the above optimizations had been applied, I decided that my basic insertion sort was the next item on the agenda to be handled. So I went ahead and wrote a hybrid merge and insertion sort algorithm (using S. Fisher’s code as a quick reference). The merge sort operates at the high level, and once the subsets generated by the merge sort are small enough, insertion sort is used instead, as it is quite efficient on small data sets. Nothing even remotely groundbreaking, but tried and tested.


What I have learned from this: Trace aborts in tight inner loops are evil, and O(n2) sort algorithms on large data sets aren’t very pleasant either. No surprise really, but it was good practice to actually use the LuaJIT extension module for displaying verbose information on the compiler state (-jv when running LuaJIT from the command line). Combined with some basic timing code, I was able to narrow in on precisely what was giving LuaJIT a hard time, rather than just blindly speculating about what might be going within that intimidating piece of awesome but complex computational machinery.

I’ll eventually be returning to this subject to delve deeper, to achieve even further performance improvements which I’m sure LuaJIT can offer, and hopefully to also clean up the increasingly ugly code style that I talked about in the first section. I have some low hanging fruit too, such as the square root and trigonometric functions which are currently converting fixed point to floating point, applying the function, and then converting back to fixed point. Even the floor function currently does this because I was too lazy to figure out what a simple bitmask would do to a negative number in 2’s complement form (round toward zero or negative infinity; a quick search indicates the latter, which pleases me). But for the trickier optimizations, I’m curious about rummaging around in the jit.util.* extensions provided with LuaJIT. I bet there’s a lot of valuable information hiding in there.

1 Comment

Leave a comment