It took me three weeks of design, redesign, more redesign, lots of implementing scattered within, and three intense days of debugging a giant mass of previously untested code, but I finally have a basic modular system in place for running all procedural generation from Lua. This will enable me (and eventually anyone else) to quickly experiment with various algorithms for every stage of planet creation and presentation.

Unfortunately, I have a lot of optimizing investigations to do, because it seems to be running about 100 times slower than the prior C++ code. But at least it generates the exact same planet (give or take a few subtle discrepancies due to slightly different math here and there). Based on some of my earlier experiments at the beginning of the month, I’m pretty sure I can bring that up to within at least 10% of the speed of my C++ implementation, and quite possibly within 50% of its speed. Just need to profile and figure out the bottlenecks. (Edit: A day’s worth of investigation has gotten me up to around 13%, or 7.5 times slower than the C++ implementation. That should be acceptable for the moment.)

A cool thing about the architecture I ended up with is that not only will it naturally support a rich modularity of hooking up algorithms at different stages to each other, but that the way this modularity is exposed will also automatically enable a significant degree of concurrent execution on multiple cores with little to no effort on the part of the script writer. Right now I have only implemented a single threaded execution model, but I should be able to change these details under the hood when I get to that stage in the project, and the Lua scripts won’t know the difference. If you’re curious, allow me to provide an overview of how I’ve designed this modularity and concurrency.

At the heart of this design is what I am boringly calling a module. Each module describes the type of resources that it works with, and the nature of how it works with them. On a basic level, these are a module’s inputs and outputs. But to give the execution engine more details to work with, I offer more granular usages. So far I have read, create, write (to an existing resource), and modify (read & write). Additionally, each of can be specified with per-element usage, indicating that the resource is a collection, and each element in the collection is needed independently of the rest.

The type of each resource is given by a combination of its storage type and semantics that describe what the resource’s data represents. So far the semantics are just a string, though I’m aware of some important features that will probably require some greater sophistication.

Here’s an example of a module definition that computes the centroid of each triangle in a mesh:

The faces are each just a three-element array of indices referencing into the vertex position array. (The number type I’m using here, Q1616, is a fixed point type with 16 bits for the integral portion and 16 bits for the fractional portion.) This module states that it only needs to read one face at a time and creates one centroid at a time, though it needs the entire vertex position array available and unmodified throughout the entire operation, since each face could refer to potentially any vertex in the array. The actual execution of the operation is implemented elsewhere, in the function CalculateMeshFaceCentroids(). When executed, instances of the three resources described will be acquired from earlier modules or created, and will be passed to the function in the order in which they were defined above. The code ends up looking something like this:

This isn’t so awful, but it’s leaving a lot of parallelism unutilized. This is essentially just a map operation, mapping each input element to exactly one output element, and is disgustingly parallelizable. So I’ve provided an object that can wrap the details of a particular map operation and return it to the execution engine to execute as it sees fit. As I stated earlier, I’m not taking advantage of it yet, but the script can already be written such that it will take advantage of it when it’s available. The centroid function could be rewritten as follows:

The map operation is told what its per-element inputs and outputs are, and what action to take with each input to generate each output. The inner action only needs to receive one input, a single face, since the closure captures the vertex position array as an upvalue. The return value of this inner action is then automatically inserted into the face centroids array behind the scenes by the execution engine that is executing the map operation.

You’ll note that this design also works incredibly well with the data-oriented design paradigm that I’m trying to utilize more and more in my software architecture. In particular, as above, I plan on using the struct-of-arrays pattern more often than the common array-of-structs that is prevalent in a lot of object-oriented design. This makes it easier for each module to access precisely and only the data that it needs, increasing opportunities for executing different modules concurrently. It will also improve utilization of the cache since loops won’t be iterating over large arrays that contain a lot of cold data, but will only be iterating over tight arrays of hot data.

We’ll see how it pans out over the coming months, but I’m feeling pretty pleased so far with the overall design.

I’m also very pleased with how my data type utilities turned out, with the aid of syntactic sugar from using Lua’s metatable system. I want to go into some detail about I implemented these types, but I should save that for another post. Especially since some of the details are likely to change over the next week as I poke aggressively at the performance issues. I know LuaJIT is capable of some incredible efficiency for a dynamic language, but I’m probably abusing it in a variety of ways that will hopefully make themselves known and be resolvable without ruining my design which is currently relatively clean.


Comment by Nasarius — 2015/01/22 @ 19:43

Are you doing a lot of calls into C code? LuaJIT is generally really good with pure Lua, but it can’t do much with foreign function calls. Of course, LuaJIT may have other weakness I’m unaware of, but 10x slower is pretty far outside the norm.

Comment by Andy Gainey2015/01/23 @ 10:03

@Nasarius, Thanks for the tip! For the most part I’m successfully remaining fully within the Lua environment, avoiding calling out to C. What I saw a lot yesterday were a wide variety of odds and ends that were causing trace aborts, and the interpreted code was running drastically slower than when it could be successfully compiled. If I were just using doubles, I suspect I would have avoided many of the problems. But since I’m using a fixed point representation, I have to be a bit more careful.

At the beginning of the day, each iteration of building a planet with my current settings took around 5 seconds. Each optimization step I took focused on one or two stages of generation at a time, and often shaved off as much as half a second from the run time. I’m now down to around 0.3 seconds per iteration, and I wouldn’t be surprised if I can find one or two remaining culprits that are ruining a small portion of the logic, causing it to irresponsibly take a proportionally large chunk of time to execute. If I can take out another 0.2 to 0.25 seconds, that’ll get me to the 50% speed of compiled code that I’m aiming for.

A prime suspect at this point is my temporary use of math functions like sqrt() and sin(). For these I’m currently converting my fixed point numbers to double, calling the math function, and then converting back to fixed point. I suspect I’ll see some notable improvements once I provide native fixed point implementations, but that’ll take more time than the optimizations I’ve already done.

In the end I’ll probably need to start looking at the generated bytecode. It’s not entirely clear to me at the moment when LuaJIT is willing to use integers and when it is using doubles, so I don’t know if my usage patterns with array indexes are causing a lot of conversions between the two. That could introduce a fair bit of useless overhead. There might be other surprises that I’m not even thinking of either, but are possibly obvious and easy to fix once realized.

Leave a comment