Experilous

I recently ran across a couple of blog articles describing a data structure that provides a set of properties that I often am seeking:

  1. Contiguous data storage, good for linear iteration and maximizing cache utilization
  2. Constant-time insertion, deletion, and relocation of elements
  3. Stable external references to elements regardless of internal relocation
  4. Constant-time look-up from external references without hashing

(The two articles were from the makers of the Molecule Engine and the BitSquid engine, respectively.)

Fascinated by data structures and algorithms in general, and lured by the appeal of these properties, I set out to study the concept and implement my own version.  It’s quite simple to implement in a basic form to achieve the above four properties, and can easily be extended to acquire some additional behaviors at the cost of additional memory or extra operations.  In this first article, I’ll walk through the basic concept, as well as a simple implementation in JavaScript.  Later articles will explore some of the possible extensions.The central principle behind this data structure is double indirection.  A single level of indirection, such as a raw pointer or an index into an array, is fast and simple, but any attempt to move the data elsewhere will suddenly and silently invalidate all direct references that might be stored and used elsewhere in code.  But if all references merely refer to a single direct reference which serves as an intermediary, then moving the data merely requires also updating this intermediary.  Look-ups from external references under this model then need only do two simple look-ups instead of one, first to get the direct reference from the indirect reference, and then to get the data itself from the direct reference.

We’ll start with a simple dynamic array and implement constant-time insertion and deletion functions, which will give us the first two properties listed above.  Then we can take advantage of double indirection to gain the third and fourth properties without losing the first two.  It will require some additional memory and some simple maintenance computations, but nothing severe, and certainly nothing that extends beyond a constant size cost per element or time cost per operation.

Let’s first look at a simple dynamic array, written in JavaScript.  The language of course already has its own dynamic array, but I’ll start by building a wrapper around this array, as we’ll obviously be extending it shortly.

There are two key features in this first version I want to point out explicitly.  First is that the handle returned to external code is abstracted as an opaque id, but is technically just an integer used for direct access of the internal array.  Second is how the remove() function provides a constant-time implementation while keeping everything packed tightly in the array without any holes.  Removing anything but the last item is naturally going to create a hole, so we simply fill that hole with the element that was previously at the end, and then we simply pop the last item off.  An obvious drawback with this implementation is that the order of elements is not stable whenever remove() is called, a behavior that will remain as we extend the data structure below.  This can be disastrous in some contexts, so keep that in mind when considering the use of this data structure.

In fact, this behavior is already disastrous, since the ids returned by add() are no longer reliable after one or more calls to remove()!  Let’s begin to add that double indirection to solve this problem.  We’ll create a second internal array which stores indexes into the object array, and which never rearranges its elements.  It may develop holes, but any indexes into this second array will be guaranteed to remain valid as long as the underlying object itself isn’t removed.

This is moving us in the right direction, but as the warning comments indicate, we still have some serious problems.  Let’s tackle the issue with remove() first, since it will literally produce invalid behavior as written above.  The issue is that the element at the end of the array is moving to a new location in the array, and thus the value in the indexes array that refers to it needs to be updated to refer to the new index (that is, the index of the item that just got removed).  But how do we find that index in order to update it?  We could scan the entire indexes array for the element that refers to the last index of the objects array, but that would make the remove() function have linear-time, not constant-time complexity.

Instead, we’ll just add a third array, this one that maps index values of the objects array to indexes of the indexes array.  And since this terminology is getting crazy, I’ll clean it up by noting that we essentially have two sets of indexes:  Those that are used to access the objects array directly are dense indexes, as they are indexing into a tightly packed array with no holes.  The doubly indirect indexes are used on the other hand to access the indexes array, which might contain holes, so we’ll call those sparse indexes.  The two arrays can then be renamed to sparseToDense and denseToSparse, since they essentially form a two-way map between sparse and dense indexes.

Getting closer. Now let’s finish off the last problem, that the sparseToDense array grows every time a new object is added, but never shrinks, no matter how many objects are removed. We can utilize what is known as a free list, often used for memory pools.  Since what we are interested in is specifically the collection of slots in the sparseToDense array that aren’t in use, we are in fact free to use those slots themselves to store that data, in the form of a singly linked list.  The slots are already generally filling the role of storing indexes that reference an array.  Why not just put them to use referencing the next free element in sparseToDense array whenever they’re not in use to reference the objects array?  All we would need then is a single head reference for the linked list, which we can store in the ObjectPool instance.

That should do it!  We now have a data structure that can store items contiguously, supports constant-time insertion (at the end) and deletion (from anywhere), and provides external references that are efficient and remain stable for the life of the element they reference.  Granted, the contiguous storage is likely to be less relevant in a language like JavaScript, as it will likely be stored in memory as a contiguous array of pointers to objects.  But the data structure is sound, and should carry over very well to statically typed languages like C++ that could benefit even more significantly from the data locality.

Next time I’ll add some extra behaviors to this data structure that require no changes to the above core, but will allow for pairs of elements to be swapped.  And this operation in turn enables the entire data structure to be sorted according to any desired criteria, still without invalidating any external references.  Later articles might include subjects such as implementing this in C++ or improving the external references so that they remain valid (but essentially null) even after the object has been removed, more or less allowing them to act as safe weak references.


2 Comments

Comment by Brandon Petty — 2014/08/06 @ 23:55

This is really cool. I wonder how parts of it perform compared to other data structures like vector, list, deque, etc. Obviously they don’t hold all of those properties, but you can insert, delete, iterate, etc.

If I am iterating over the structure, can I delete elements of my structure (like the current one) and keep iterating?

Comment by Andy Gainey2014/08/09 @ 23:50

I’m working on a C++ version right now. Might do some measurements in comparison to standard containers when I have it working well enough. But it’ll take a bit of effort to figure what to compare, since this container doesn’t provide exactly the same interface (e.g., no insert, only a push_back) or guarantees (erase won’t preserve order of other elements). It offers a mix of advantages from a bulk allocation container like vector, while also offering some advantages of node allocation containers like a linked list or tree, so that might guide the types of comparisons I try out.

In general, in-order traversal is as simple as a linear traversal of a vector, and push_back is similar to vector push_back, but needs to touch three containers instead of one. As for erase, it has constant time complexity (no matter the position of the element erased), but also needs to touch all three containers, as well as perform a copy/move of one element, which could be costly depending on the type. A lookup from a stable handle to a direct reference is also constant time, but requires indexing into two arrays in sequence. Going from a direct reference to a stable handle is the same, except that it only requires one array index operation.

As for modifying the container while simultaneously iterating over it, the common types of operations (erasing the current element or adding a new element to the end) should work just fine. I talk about that a bit in my latest post. http://experilous.com/1/blog/post/dense-dynamic-arrays-with-stable-handles-part-2

Comments are closed at this time.