After fighting with some silly bugs, and fiddling with map generation and the data structures used to represent the map, I finally reached the point last week where I could begin to assess performance. It was immediately obvious that the pathfinding algorithm was the bane of my program’s execution. Something absurd like 98% of the running time was dedicated to pathfinding. The subsequent operation of following the paths found and adding to congestion was trivial in comparison.
To more effectively focus on just pathfinding, I created a new project that was designed to run two different pathfinding algorithms over the exact same list of origin/destination pairs, and collect both timing data and algorithmic data on each. One of the pathfinding algorithms would generally be a baseline that I had already tested, and the second would involve some form of experimental tweaking.
The tweaks came in two flavors, which I’ll call technical and algorithmic. The technical tweaks were not intended to change behavior significantly, but were only for improving the performance of the operations already being done. The algorithmic tweaks on the other hand were intended to make the A* algorithm behave very differently, with the intention of making it intelligently consider fewer nodes along the way to finding an acceptable path. (Under normal operation for the types of maps I’m dealing with, a basic A* algorithm will consider a large number of nodes that it eventually decides aren’t part of the final best path. It would be nice to just ignore those nodes altogether, but to do that perfectly, the algorithm would need to see into the future.)
By the end, I was able to calculate around 12,000 paths per second on a grid-like map of highways, major roads, and minor roads that was roughly 100 square miles (260 square kilometers) in size. Each path on average was about 7 miles long, and took roughly 7.5 minutes to travel. Larger maps shouldn’t really affect the speed nearly as much as longer paths would. As long as most paths stay around 7 miles in length, the same speeds should remain achievable on larger maps. (The bigger difficulty would largely come from more paths needing to be computed, as there would probably be more people moving around.)
Below is a (perhaps excessively detailed) chronology of tweaks that I went through over the past week. As a summary, though, I’ll note the following takeaways:
- Being willing to accept a less-than-perfect pathfinder and thus use pessimistic estimations (i.e., an inadmissible heuristic) can save a large amount of computation time while only mildly affecting the quality of the paths found.
- The method of estimation doesn’t need to be fixed. Changing the way the estimation works depending on the intermediate state of the pathfinding algorithm can provide the developer with a lot of control over how the algorithm chooses which nodes to consider and when.
- As long as it doesn’t take too much time, it often doesn’t hurt to try writing your own implementation of a particular data structure or algorithm instead of just relying on whatever is provided by the standard. Generic code often has to make compromises in order to work reasonably well under a wide variety of possible conditions.
- Having a good framework available for examining what the A* algorithm is trying to do is vitally important. In my case, I was able to see the cost of the final paths, number of priority queue pushes and pops, queue size, hash table lookups, and similar stats averaged over all the paths, and compare numbers between a baseline pathfinder and some experiment. I wish I had gone further so that I could visualize the algorithm over time, but I was able to make reasonable progress without it, and the visualization would have taken more time to implement.
- Make sure you know what your tweaks are actually doing, or have numbers and/or visualizations like just mentioned so that you can verify it is doing the right thing. (See item #10 below for what can happen if you don’t.)
Now then, on with the extensive details of all my tweaks and experiments:
#1 (technical, no improvement) In my initial representation, every node contained a pointer to an array of outgoing edges, and a count of how many edges were in the array. This meant that each time the A* algorithm visited a node and wanted to consider where to go next, it had to follow the pointer to a new location in memory, possibly causing the CPU to wait for the data to be fetched from memory if it wasn’t already in an appropriate cache. It additionally had to loop over a variable number of edges, which could potentially throw off the CPU’s branch prediction logic, causing frequent pipeline flushes. This was all very theoretical, but it concerned me nonetheless. So I instead made every single node directly contain exactly three outgoing edges, with no need to follow pointers, and no need to check how many edges there were. (If a node normally contained fewer than three edges, the excess edges would all point to a faked target node as their target, and would have an infinite cost.)
The result: No benefit that I could tell. In fact, following all the infinite-cost edges was a problem, because it bloated the number of items in the queue of nodes to be checked, which also increased the cost of pulling nodes off of the queue. So I added a check that would quit looking at edges as soon as it encountered the first infinity-cost edge. But I left the consolidated memory representation, because that still made sense to me, even if I couldn’t see the benefit.
#2 (algorithmic, 3x improvement) If you want to find the absolute fastest path from one node to another, then your A* algorithm cannot be even slightly pessimistic when it estimates how long it will take to get from an intermediate node to the destination node. This is find when your fastest edges are also your most common. This is not true in a city, however; in fact it’s the opposite. The fastest edges are the rarest. You’ll have tons of side roads and suburbs and whatnot, a smaller number of major roads and avenues, even fewer highways, and if you’re lucky, a bullet train connecting a few key areas. But the A* algorithm has to assume, from wherever it is in the map halfway through execution, that traveling a few more feet down some edge might land it on a highway or a bullet train that goes directly to the desired destination. The result is that before it commits to going down the avenue that goes in more or less the correct direction, it desperately searches to the sides and even backwards for that hypothetical highway or bullet train. Only once it has satisfied itself that isn’t going to find such a route, it gives up and accepts the avenue.
A cool thing about the A* algorithm is that you have some interesting tradeoffs that can be made. And they aren’t binary tradeoffs, but can have all sorts of nuances. When estimating the remaining cost to get to a destination, for example, you can make the estimate pessimistic, but the consequence is that you’re no longer guaranteed to get the fastest path. You’ll get a valid path, and it might even be a pretty good path as long as you’re not too egregiously inaccurate with the estimate, but it just won’t likely be the best one. Well, that’s fine for my purposes, so how would I want to alter the estimate?
I chose to use the speed of the most recently traveled edge and the remaining distance to the destination (as the crow flies). While it is true that you as a traveler might get lucky and find that bullet train 20 feet away, most likely you’re just going to continue to travel on roads that are just as fast as the one you find yourself on right now. This simple change resulted in a 3x speed boost, and as far as I could tell, did not significantly harm the quality of the paths found. It made the pathfinder prefer to jump onto higher speed networks, and stay on them as long as possible before finally getting off to head for the destination. Sounds a bit like what a normal commuter would do.
#3 (technical, failure) My priority queue was at this stage just using the standard push_heap() and pop_heap() functions on a vector, designed to keep an array of data just barely sorted enough that it is always easy to find the item in the queue with the highest priority. But profiling revealed that I was still spending a lot of my computation time pushing to and popping from the queue. It seemed to me that most of the items that went into the queue never came back out (the algorithm would find a good path before it had to consider many of the less promising alternatives). So I tried to implement a different priority queue that would only keep a small number of the best items sorted, and everything else it would dump into a junk array in no particular order. As long as I was pulling items from the sorted array, things would be snappy, but trying to find the next best item in the unsorted array would be very slow. Long story short, my assumptions were incorrect, and this was slower, regardless of how large or small I made the high-priority sorted array.
#4 (technical, 3x improvement) Every time I was looking for a new path between a new pair of nodes, I was allocating a bunch of memory for state information, and initializing it all In particular, I had to specify for all nodes that I had not visited them yet, nor were they in the queue waiting to be visited. C++ was doing this for me automatically by simply setting the value of all allocated bytes to 0, including all the state information that didn’t need to be initialized, since it would just get written over before ever being read anyway. Well, between the memory allocation (which is rarely a free operation) and the initialization, I was wasting a lot of time. So I rearranged things to only allocate memory once, and kept the visited/pending values separate from all the other state information so I could efficiently initialize only those values. That gave me another 3x speedup.
#5 (technical, failure?) Up to this point, I was maintaining pathfinding state information for every single node in the whole map. This was fine on artificially small maps, but for large maps, I knew that this simply would not work. Even with the initialization improvements above, I’d still need to initialize some data for all nodes in the map. It also just used up a large amount of memory. Not only was this a waste, but it also decreased spatial locality (accessed memory was frequently not near other recently accessed memory, causing more painful cache misses).
The trick is to be able to take any arbitrary node and quickly figure out where the pathfinding information for it is stored. Arrays made this super efficient because I could just use the index from one array and look up the data in another array that the same index, but arrays were the cause of the memory woes just described. What I really needed was an associative map with a more compact storage scheme. The two contestants are generally balanced binary trees (usually not too shabby) or hash tables (can be utterly awesome sometimes). I went with the hopeful awesomeness of a hash table.
If I recall, it was a bit slower, but I wasn’t totally bummed, because I new it would scale with larger maps. No matter how large my map got, I knew that I would need to initialize very little data and use very little memory. And the amount of memory used would be proportional to the length of my paths, not proportional to the size of the map. This was a valuable benefit, so I accepted a bit of a slowdown on my stupidly small map.
#6 (algorithmic, small improvement) One of the most annoying parts of the A* algorithm as formally specified is that when you’re considering following an edge to another node, you have to figure out if that target node is already in your queue pending a visit. If it is, then that essentially means that you’ve just found a second (or third or fourth) way to get from your origin to this intermediate node, and you have to see if this new way is better than the previous way. If so, you need to update the pathfinding state with this new information, and then move the node up in the priority queue since it now has a higher priority than previously.
All of this is typically a pain for most priority queues. I chose to just skip it. Doesn’t matter if a node is already in the queue, just stick it in the queue a second time. If the former path was better, it’ll get examined first. If the new route is better, it’ll get examined first. If both ways suck in the grand scheme of the overall path, then perhaps neither will get examined. But in the worst case they’ll both get examined. Oh well.
Skipping this check also meant that I no longer needed to save easy-to-find information about whether a node was in the queue or not. So I eliminated that data, saving even more memory and more time with initialization. The improvement was only marginal, but it made the code a lot simpler, so I was happy.
#7 (technical, definite improvement) Remember that hash table from #5? It was just a default implementation that came with Visual Studio 2010. Generic libraries often lose optimizations because they have to cover all cases, not just a specific case. I decided to quickly roll my own hash table implementation, one that would map a 64-bit pointer to a 64-bit integer. I chose to use open addressing with linear probing (hoping to maintain cache friendliness). For the hash function, I was inspired by Knuth’s choice of using the golden ratio, and thus took the pointer interpreted as a 64-bit unsigned integer, multiplied it by 1425089352415307197 (which is 2 to the 64th power, divided by the golden ratio), and then masked out the most significant bits so that the remaining index would exactly cover the current capacity of my hash table. Nice and simple. My hash table version was now just as fast as the array version, and I still retained the confidence that it would scale very well with map size.
#8 (technical, unknown improvement) In addition to writing my own hash table implementation, I also was a naughty scientist and made a different change at the same time, without measure the effects of each change independently. Well, I might have measured them separately, but if so, I’ve completely forgotten the results. This change was to reduce the size of the elements that were stored in the priority queue, so that any rearrangements that the priority queue algorithm needed to make would involve fewer total bytes. What I was left with were queue elements that only stored a priority value and an index into a separate array containing all the other relevant details. This separate array would only ever have items appended, so there was no notable memory copying costs associated with all of that data. It seemed sound to me, so I left it in the code with all future tweaks.
#9 (algorithmic, 2x improvement) In tweak #6, I used the speed of most recently traveled edge to estimate how long it would take to get the destination. I wanted to make this estimation smarter by taking into account not just the edge itself, but also all of the further edges that I would be examining over the next few steps beyond the current node. So I implemented a preprocessing step to store per node a weighted average speed of nearby outgoing edges. Immediate edges were weighted more strongly, and distant edges more weakly. This worked a little bit better, but I was curious about the different methods of weighting I could utilize. In particular, I had arbitrarily picked how far out the weighting goes before it drops to zero, and was curious how changes to this distance would affect the performance. (I’ll also note that my brute-force preprocessing step took a long time given my initially chosen distance. Shorter distances would make preprocessing run faster, so I wanted to use shorter distances if I could get away with it.) So I dropped the distance by 25%; the performance benefit was a bit larger! 50%, better still. 75%, 90%, all better. Eventually, I reduced the distance so far that I was merely using the average speed of all immediately outgoing edges from each node. And that was twice as fast as when I was using the speed of the most recently traveled edge. Even better, the paths that the algorithm generated were closer to the best possible path! Double win! And naturally, the preprocessing step was super fast, since it only had to do a simple averaging of no more than three edges for each node.
#10 (algorithmic, no improvement) I hypothesized that reaching certain intermediate points in many paths were particular difficult for A*, because in some cases it might severely overvalue a bunch of nearby wrong options and take forever to finally accept the right option. I thought that if I identified this most difficult segment along the final path, and then used that waypoint to help guide A* the next time it needed to find a path between the same origin/destination pair, it might do so much more quickly. My first attempt at identifying this critical waypoint was to simply pick the very first node in the final path that happened to have the fastest travel speed the path would contain. Basically, instead of letting the pathfinder figure out on its own how to use the highway to get to the destination, just tell it to head straight for the highway at the waypoint, and only then bother to figure out how it’ll get to the destination. This didn’t help at all.
Second, I took a more rigorous approach, and identified the waypoint by measuring the number of iterations between when each node was inserted into the priority queue, and when it was finally removed from the priority queue. The intention was that large values meant that the algorithm considered a lot of other nodes before finally deciding to try that node. If it had skipped all those other nodes and went immediately for this chosen node, that could save a lot of time. But just as I write this now, I realize that I implemented this very poorly, and did not actually guarantee that the waypoint identified was actually even part of the final path! I suspect that I was often sending the pathfinding algorithm to a completely incorrect waypoint the whole time, leading to no performance improvement, and a reduction in path quality. Oops! Well, that gives me another avenue I could explore in the near future…
Regardless, this only works when you expect to pathfind between the same origin/destination pairs numerous times, can quickly identify that these pairs have been calculated before, and do not want to incur the memory penalty of storing the full path between each use.
#11 (algorithmic, failure) I was curious what would happen if I never let the remaining distance estimation get more pessimistic over time. So instead of using the average speed of outgoing edges from the current node (#9), I would use the fastest of these averages I had encountered on my way to reaching the current node. In short, this did not work at all. Slowed things down a ton.
#12 (technical, no improvement) Historically, using integers over floating points has been a way to squeeze performance out of a numerical algorithm. I know floating point computation has improved drastically since the 386 days when I first got into computers, but still I was curious. I also was curious about reducing the amount of memory I was using for all those numbers, due to cache concerns which are still very much with us today. So I switched out all my 64-bit doubles for 32-bit integers, and also tried out a few different integer square root algorithms. None of these changes had any real effect on performance, but it sure did make the code messier!
#13 (technical, marginal improvement) I decided to back away from integers, but try 32-bit floats instead of 64-bit doubles. This helped a little bit. Less memory to shove around, I presume.
#14 (algorithmic, marginal improvment) The square root operation that was part of the Euclidean distance calculation was still bugging me. So I tried a couple other distance measurements, intercardinal and Manhattan. They were quite a bit better. I was expecting it to be largely due to faster computation of distance, since there was no square root involved. Looking at the numbers, though, it was almost entirely due to A* being more intelligent about which nodes to consider. Euclidean distance is unsurprisingly very optimistic. Being more realistically pessimistic about the shape of the map made the other two algorithms perform better, without compromising path quality much.
I was suspicious though. My auto-generated map happened to line up well with Cartesian coordinates, so it’s no surprise that Manhattan distance worked well. What if I rotated the map, though? Sure enough, a 45° rotation really ruined Manhattan distance. It lost its performance benefit, and the quality of its paths suffered noticeably. Intercardinal distance managed better, since it can handle 45° angles just as well as 90° angles. 22.5° angles on the other hand caused it to also lose its performance benefit, leaving it roughly tied with Euclidean distance in this worst case, though its path quality was still almost as good. Since most maps wouldn’t aggressively exercise its worst case, I found it acceptable to favor it over Euclidean distance.
My current implementation does still have a branch in the computation, something that Euclidean distance doesn’t need to worry about. I haven’t tried to construct a branchless variant yet, but that might be a way to squeeze out some improvement, if the branch is indeed causing a notable number of pipeline flushes.
#15 (technical, unknown improvement) Being a bad scientist again, I improved my queue data array from #8 to not use quite so much space. When I pop an item off the queue and access the array to get the supplement data, I know that I will never need that data again. So when I push the first neighboring node onto the queue as a new item, I can reuse that slot in the array, instead of always appending to a new slot at the end of the array.
#16 (algorithmic, failure) I didn’t like the way the algorithm was currently having to travel over numerous short edges simply because what was normally a single long edge had been broken up by a lot of nodes leading off to side roads or driveway/parking lot entrances. So I did some preprocessing to create additional edges that would go straight from the start node of the segment to the end node; the simulation cost would be the same as taking all the short edges, but the algorithm would only need to process the one edge. Unfortunately, I couldn’t figure out how to properly make the algorithm figure this out. It still insisted on processing all the short edges and the long edge, making the overall process take longer.
I tried to shift the estimates so that the algorithm would prefer quick progress but met with limited success. Well, that’s not entirely true. I did discover that I could speed up the performance by ignoring the real intermediate cost altogether and focusing just on the estimated remaining cost, but this worked just as well for the version of the map without shortcut edges as for the version with the shortcuts. It also hurt the quality of the paths quite a bit, so I gave up on that for the moment.
#17 (technical, 3.7x improvement) The last tweak I performed wasn’t really making it run faster, just making it use more of my computer. That is, I added a trivial implementation of multithreading. With four cores, I was able to get a nearly 4x improvement, so I was happy (but not surprised) about that. I had spent effort during the whole process to ensure that the algorithm would be easily parallelizable. In particular, I avoided storing temporary pathfinding data on the nodes and edges themselves, insisting instead on using separate arrays and hash tables. That way, the map data is purely read-only, and each thread can happily work on its own local data without stomping on any other thread’s data.
If I did want to store the pathfinding data directly in the node, I could have stored an array per node, with one element per thread that could be doing pathfinding calculations. Each thread would only use its own index into those arrays, and so they still wouldn’t be stomping on each other. Technically. In practice, I’m pretty sure I’d end up with false sharing in this case, since the data would likely share a single cache line, and cores would have a mess of a time telling each other about changes made to cache lines that they’re all working on. So while that scheme would avoid the cost of using a hash-table, I doubt it would be worth it in the end. It also has the unfortunate effect of once again bloating memory usage substantially.