NavMesh visualization using SIMD

The NavMesh defines the area's AI agents could traverse. In order to build a good NavMesh, we need a way to visualize them in the engine. This version of the Valve Source Engine did not feature any debugging tools, as this was a production build used for retail. I had to come with my own solution for drawing them in the most performant way possible.

Introduction

The engine we are working with is the Source Engine used for the game Apex Legends (Originally written by Valve Corporation, heavily modified by Respawn Entertainment).

This engine features an implementation of the Recast & Detour navigation system,

which is an open-source navigation mesh construction toolset written in C++.

The goal of visualizing the AI Network and NavMesh are:

Allowing coders/scripters to work more easily with AI.
Debugging the standalone NavMesh editor and systems within the engine.
Debugging AI agent behavior for each hull type.
Debugging NavMesh for each hull type (5 types, each having its own mesh).

The engine

This engine takes advantage of C++ polymorphism, which expose their functionality through virtual functions. I came across 3 debug render functions while I was reverse engineering some of these virtual function tables: "v_RenderLine", "v_RenderBox" and "v_RenderWireframeSphere ". In the image below, you can see 2 of these functions in action.

Obtaining the functions

The picture below shows how a function address is resolved.

Since these functions were not part of a virtual function table, but were being called from a virtual function, we have to obtain the address of the function using a signature scan, which scans a certain region of memory for a unique byte pattern. We created a dedicated utility class for Microsoft's PE (Portable Executable) file format, in which 'g_GameDll' is our handle to the actual executable image.

The "CModule::FindPatternSIMD" method resolves the address of a function with a byte pattern and a mask, the mask is the wildcard (were "x" is the explicit byte, and "?" could be any byte). This in combination with the utility class allows us to:

Resolve the address of any version of the image as long as the function or compiler settings hasn't been modified.
Resolve the address, even when the image has ASLR (Address Space Layout Randomization) enabled.
Only scan relevant regions of memory, in this case the .text (executable) segment of the image.
Very fast scanning! The system could resolve 300+ byte patterns in 1.4 seconds (1353717704 clock cycles) over a 10MiB image, thanks to SIMD. Dispatching them to parallel tasks would reduce the time to the slowest task invocation (300ms). Due to feature limitations during DLL init, we unfortunately could not take advantage of this without a major refactor of the SDK code base. For now this will do!

Structures

Once we got the pointers to the functions, we need to figure out how to call them. These particular functions take several structures as parameters (OverlayBox_t::Transforms, Vector3D, etc). This step is quite involving, it requires reverse engineering live memory and the static image. I made an article about reverse engineering classes and structures here.

Once reverse engineered, we get something like this:

Here we have an OverlayBase_t type, which is used for all overlay types (box, sphere, triangle, capsule, etc..).

We are interested in the OverlayBox_t structure. If we take a closer look, we can see another structure

called Transforms. This structure defines the transforms of the box (scale, rotation, position).

Transform contains an array of 3 Vector4D's.

Vector4D is a 16 byte aligned datatype (128 bits):

Now that we know the datatype is 16 bytes aligned, we can take advantage of this later by using a union:

The union declaration instructs the compiler to use the same offset for members declared in the scope of this union.

Transforms now has 2 members declared union (xmm and vec), _m128 is used for Streaming SIMD Extension instruction intrinsics, and Vector4D is used internally for our box transforms (scale, rotation, position). These 2 members are exactly the same size and share the same memory location. This means that, whatever we change xmm[2] to, will be reflected in vec[2].

Singletons

Game engines have a lot of singletons, which are typically used for the custom memalloc system, filesystem, the BSP/mesh loader, etc. Since this game has 5 types of NavMesh's, and an AI Network for very large AI entities, these have to be stored somewhere accessible, to be used by various systems around the AI system.

Using our reverse engineering skills, and Portable Executable utility class from above, we can obtain a pointer to the pointer of the AI Network in a similar way of getting pointers to functions compiled into the game executable.

We also have 2 extra pointers! One to the array containing the pointers to our 5 NavMesh's, and a NavMesh query pointer. Notice the dtNavMesh structure, this has been modified from the original Recast & Detour library, and this is my attempt on reverse engineering the restructured class, and also figuring out some of the new fields which do not exist in the original library.

Rendering and optimizations

Now that we got everything we need to write our render functions, we can start out by writing a function that renders all script nodes stored in the address our AI Network pointer is pointing at.

Notice how we use the __mm_set_ps intrinsic. This SSE intrinsic constructs a __m128 type from 4 floating point scalars in a single instruction. Since we used the union trick, we don't have to deal with any conversion, type casting or creating inline wrappers for __mm_set_ps to take a Vector4D . The internal function v_RenderBox takes a const reference of our computed OverlayBox_t::Transforms type, and specifically uses the Vector4D member to determine the transforms of the render box in the engine.

If we take a look at the disassembled function, we can see our SSE instructions used to set our transform fields. There is one problem however, we are subtracting the constant '50.f' 3 times using the 'subss' instruction after we moved the origin from our structure pointer into the XMM register using the 'movss' instruction, which is the operation 'pScriptNode->m_vOrigin.x - 50.f' in source code. Even though modern compilers do a fantastic job optimizing programs, explicit manual SSE programming is required to yield the best results.

To do so, we can subtract the constants from our origin scalars in a single instruction:

Here we create a subtraction mask in reverse order, we also pack our origin scalars into a single XMM register in reverse order, both using the _mm_setr_ps intrinsic. The reason we pack them in reverse order, is because the SSE instructions packs our scalars in little endian, including the order of the parameters. We use the reverse version to pack them in the order of the structure. The _mm_sub_ps intrinsic subtracts the 'xSubMask' constant from our packed origin scalars stored in the 'xOrigin' register. The ordered result is yielded to the 'xResult' register.

If we take a look at our assembled function again, we can see that we now only have 1 subtraction operation, the highlighted 'subps' instruction, subtracting from xmm3 ('xOrigin') using the constant 'xSubMask'. The overall instruction count has been reduced quite significantly as well, as we no longer have to move the individual result of the previous 'subss' instruction into our registers before each 'movaps' instruction.

The last instruction is what actually sets the scalars into the memory of our structure, using the pointer reference stored in the RBP register. This operation is performed 3 times, as we have an array of 3 128 bits vectors defining the final transforms of the box.

We can optimize the code even more by declaring the constants as static const, construct the box's mins and max's at compile time using SIMD, and by caching the result of 'pNetwork->GetNumScriptNodes()'

Remark: AVX (Advanced Vector Extensions) could had been used to make this function even faster, but since we are essentially a 'mod' for the game, we cannot break the min-spec barrier of the base game, which only requires a x64 CPU with the POPCNT instruction, SSE, SSE 2, SSE 3 and SSSE 3 (Supplemental SSE 3 Instructions). AVX is a much newer technology and will break compatibility with older processors that are supported by the base game.

Results

The 'ai_script_nodes_draw' global, is a console variable. We use this cvar to determine the index of the AI node we want to draw. Setting this cvar to '102' for example, will make the engine render AI script nodes from the index 102 onwards. We can also set a range using the cvar 'ai_script_nodes_draw_range', for example, node 10500 to node 11000, which will only draw nodes within this range. A range of '0' means no user defined limit (the engine's debug overlay limit will be used instead).

We can now start using the very fast render function.

The green boxes visualize the AI script node position.
The red lines visualize the link to the next AI script node.

We can cover larger area's as well, with index shifting and range limiting using the console variables we implemented.

Extending functionality

Here we took our original 'CAI_Utility::DrawAIScriptNetwork' function and added a new feature the user of the engine could toggle in the console. When the global console variable "ai_script_nodes_draw_nearest" is set,

the code will render links from the current node to the nearest node.

This code has one flaw, however.. When we call 'PackNodeLink' with 2 node indices, (2, 528), and we swap them, so we have (528, 2), the node packing results will deviate, resulting in a different hash, causing duplicate links to be drawn, which will greatly reduce overall render performance..

to solve this problem, we simple check if value 'a' is smaller than 'b', and swap if this condition is met, so we always end up having one possible combination of node indices:

This solves the defect of rendering duplicate links between 2 nodes.

We can now draw links from a certain node to its nearest node. Notice how several neighbor nodes have a link to the center node. This particular engine has a script API featuring the Squirrel programming language, which gameplay programmers use to create game mechanics without touching the native engine code (similar to Unity's scripting API and Unreal Engine's Blueprint system). This API exposes a function called "NavMeshNode_GetNearestNodeToPos", a gameplay programmer could use this function to traverse an AI vehicle from its current node to the nearest node. Since we utilize the same algorithm, the visualization will actually represent the behavior of this function, allowing the gameplay programmer to see what its code will do.

This is the boilerplate code for packing node indices. Parameter 'c' and 'd' are unused at the moment.

to solve this problem, we simply check if value 'a' is smaller than 'b', and swap if this condition is met, so we always end up having one possible combination of node indices: