A downloadable exercise

TLDR

In which I speed up an octree algorithm by ~9000% via data-orientation (cache-coherence, buffer pre-allocations etc.) and multi-threading using Unity's JOBs system... to ultimately realize that my technical approach is unreconcilable with my design goals.

repository


Initial Design Goals

I wanted to experiment with simulating the material properties of working with a malleable, dynamic substance like clay.  In modern 3D modelling softwares, the mesh is king. This means that all the physicality and dynamic properties of a material are constrained to an infinitesimally thin shell -- which is often completely static anyways until the artist decides to employ their virtual brush. 

My desire is to create a simulationist-oriented 3D modelling experience in which artists can stretch, tear and layer material with their hands (using VR controllers). The material will have a physical life of its own: reacting to touch not just at the point of contact, but as a whole intra-connected body, morphing over time in reaction to gravity and other physical stresses.


Focus of devlog

At the time of writing, I have only tried one technical approach to solve this problem. Unfortunately it has become clear that this approach won't allow for the simulation and rendering fidelity required to achieve the goals I laid out above. Nevertheless, the implementations and the optimizations I made to be able to confidently reach this conclusion were personally educative and worth documenting  in and of themselves. These optimizations will serve as the primary topic of discussion for the remainder of this devlog.


Technical Challenges

To create a realistic clay simulation that could deform, tear, and come back together I knew I'd have some sort of particle simulation. The question immediately arises: how does one render a particle sim so that it appears as a single uniform body? If the edges between particles are relatively stable and can be filled in to form a surface then the solution isn't so involved.  Just procedurally create the mesh before rendering or in the geometry shader based on the particles' edges/joints (I imagine cloth sims work this way). When the sim is especially three-dimensional -- when there aren't intrinsic properties of the sim that lends it to being represented as surfaces/meshes (like with cloth sims), then a real-time solution is not so straightforward.


Solutions

A typical solution is to first voxelize the particle sim into discretized chunks, then mesh the voxelized density points using a Marching Cubes algorithm. I've implemented Marching Cubes in the past, and wanted to see if I could find another solution to the problem. Marching Cubes works on a discretized and uniform grid and therefore makes any results appear blocky and uniform -- not the organic material vibe I was going for. 

I wanted to see if I could create a performant enough sim without employing a grid at any stage -- using purely Langrarian (particle-based) simulation and rendering. The only rendering solution which could take non uniform particles and make them appear as a single body while generating all the normal info required for 3D lighting was a Ray Marching shader with signed-distance functions. I knew that a brute force implementation would not be sufficient performance-wise, so this would inevitably become an exercise in optimization to see if these solutions could even work.


Optimizations

I needed a spatial partitioning system which drastically reduce the O(n) complexity of my particle sim. There were two options I was grappling with: a multi-layered grid approach, or an octree. I went with the latter mainly because I wanted to gain experience implementing an octree. I also wanted the freedom for the particle sim's density and locality to be completely heterogenous if need be -- as octree's are sparse and dynamically respond to variations in density.

Results With 300 Particles

Mili seconds (ms)
Niave brute force94
Niave Octree (single threaded, unoptimized)10

The first results are super lack luster. 2/3 of a frame (10ms) is used to simulate just 300 particles!? This should have been a major red flag, but I decided to go ahead and try to optimize the octree algorithm as much as I could anyways and give this approach a fair chance. 


Making the Octree data-oriented

Now I started to employ Unity's  new data-oriented  features. I used their native containers instead of .NETs to ensure buffers were actually continuous and there wasn't any implementation-detail weirdness happening behind the scene. I employed structs over classes (values > references) to avoid cache misses. I shrank the size of my octree-nodes, storing the indices of child nodes implicitly using a fixed order to increase the amount of octree-stuff that could fit in cache and reduce misses. I removed unnecessary copying using ref structs or local functions. I then Jobified (multithreaded) everything except for the insertion of the octree, which was still compiled via BURST. For every particle, I sorted octree queries by distance using an insertion sort (which is very fast for small arrays as it's cache friendly) and only simulated the effects of the closest particles.

Results With 600 Particles

Mili seconds (ms)
Niave Octree88
Jobified, data-oriented octree 1.5

It became clear in this test, that while the optimizations made huge improvements in the simulation's performance (which effectively had an identical algorithmic complexity). I would still not be able to simulate close to the number of particles needed for a high fidelity clay simulation.  Now that I had some idea how many distance tests and iterations would be required on the GPU side I did some signed-distance ray marching tests. I discovered that there as well, there was no chance that this approach would scale well enough, even with more drastic optimizations.


Conclusions

In retrospect, because of the type of simulation I was creating -- in which each particle has a very large and tapering area of effect -- spatial partitioning doesn't help that much as each particle covers such a large portion of the sim. You can only partition so much when the atomic units of the sim are relatively huge. I think I can comfortably say that a Eulerian, discretized approach is the only viable real-time solution going forwards (although I can still experiment with a mix and matching of sim types).

I was pretty amazed at how much faster that data-oriented, JOB-ified algorithm was. It was amazing seeing the significant increases in performance resulting from pretty modest changes like removing a couple unnecessary floats from an octree-node (resulting in more useful info being held in cache).

Asides from the 9000% performance increase, what's cool about this data-oriented octree algorithm that I ended up with is that it could be super easily uploaded to the GPU and used in a shader. The entire system consists of only blittable types, which are all stored in flat, contiguous  buffers instead of in huge nested hierarchies. I had already begun converting octree-nodes into 4x4 matrices to send to the GPU before I discovered that this approach could never be performant enough.


repository


Leave a comment

Log in with itch.io to leave a comment.