# Compute shader: Voxel cube mesh generator

## Introduction

Originally my plan was to create a ray tracing water caustics shader for unity. This however takes a lot of ray casts to accomplish, which aren’t the cheapest thing to do. So in order to achieve this effect I needed a technique that could do a lot of calculations really fast. The technique I found to achieve this effect was with the use of compute shaders.

After spending a long time learning about compute shaders and how to implement them, I came to the conclusion that there wasn’t enough time left to create the water caustic shader. Therefore I decided that I would instead focus on using compute shaders to improve the generation speed of the voxel cube mesh generator that I made previously.

In this report we will go over what a compute shader is, the basics of using a compute shader and how I used compute shaders to improve the generation speed of my voxel cube mesh.

## What is a compute shader

In the process of creating a game or simulation you might encounter situations in which you need to do a lot of repetitive calculations in order to get the desired effect. These calculations take up a lot of time which can greatly effect the performance of your program. This is where compute shaders come into play.

Compute shaders are scripts and calculations that are run on the GPU instead of your CPU. On the CPU code is run single threaded and in sequence, the next line of code waits until the previous line of code is executed. Calculations can take some time and this adds up quickly when you need to do a lot of calculations.
The GPU however is made to run parallel threads, doing multiple calculations at the same time. It isn’t as flexible as the CPU but it is a lot faster. Usually the GPU is mainly used for rendering high resolution pictures and textures, but by using a compute shader you can also run other calculations on the GPU like machine learning and computations.

In short a compute shader is a way to offload calculations from the CPU to the GPU in order to process it faster by dividing the task over multiple threads. First you send data from the CPU to the GPU, then you let the GPU do the calculations and afterwards you read the results back into the CPU for further use.

A compute shader contains one or more kernels which are the main “update” functions of the shader, when you dispatch a kernel it will run its code once for every dispatch thread id. You split these treads into work groups that run parallel with each other. How you do that exactly will be explained in the “How to use a compute shader” chapter.

Compute shaders use the High-Level Shading Language (hlsl) coding language by default so to get started you might need to learn a bit of hlsl.

When you create a new compute shader you will notice four main things namely: at the top of the script is a pragma kernel named CSMain, below that is a variable RWTexture2D<float4> named Result, next is line that says [numthreads(8,8,1)] and directly below that is a void CSMain function with a simple calculation inside that sets the color of the “result” variable.

``````// Each #kernel tells which function to compile; you can have many kernels
#pragma kernel CSMain

// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
RWTexture2D<float4> Result;

void CSMain (uint3 id : SV_DispatchThreadID)
{
// TODO: insert actual code here!

Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);
}``````

The pragma kernel name decides which function will be run when a kernel is dispatched. One compute shader can have multiple kernels which execute different functions. You do this by repeating the line below it, just with a different name. One thing to keep is mind is that when dispatching a kernel from the CPU you do that with the kernel ID and not the kernel name (0 for the first, 1 for the second etc.)

The CSMain function is the fuction that the kernel CSMain wil run when dispatched. So long as both have the same name, the name itself doesn’t matter so you can name it whatever you want to keep it organized. It has a default argument “uint3 id” that gets filled with the “SV_ThreadID”. This keeps track of what the id of the current thread that you are running is.
This might be the first time you see a uint3 so to explain: A uint3 is a variable that contains three unsigned integers x, y and z. An unsigned integer is simply an integer that cannot go negative but has double the max value. If there is a need for it you can chance the uint3 into an int3 which has halve the max limit but can go negative.
You use the uint3 id to differentiate between the different threads in your code.

Above that CSMain fuction is the [numthreads(8,8,1)]. This dictates the GPU work groups and how they are divided per axis. I’m not 100% sure about the optimal use of this but from my experience it’s recommended that you keep the total threads at 64. You get the total by multiplication (in this case 8*8*1 = 64).

The RWTexture2D<float4> Result is a 2d array of float4 colors corresponding to pixels. In the example code it is used to form a colorful fractal. This is one of the ways to transfer data between your cpu and gpu. You first set a texture to this variable in the CPU, edit it in the GPU and read it back in the CPU. The RW means that you can edit the variable, without RW it will be read only.

## How to use a compute shader

The first step in using a compute shader is to declare a compute shader variable in your script. The easiest way to do this is by making the variable public and dragging the compute shader into it.

Next step would be writing the code you want to run in your compute shader script. You might need to learn a bit about hlsl to know what variables you can and cannot use.
Something to note is that unlike a c# script you the GPU script only has basic computational functions and doesn’t have most functions you would find in a c# script. In order to use a cube for instance you first need to define what a cube is yourself.
You can use functions inside a compute shader but keep in mind that only the function linked to a kernel can be dispatched from the CPU, you can use the functions you have created inside the function linked to the kernel as usual.
One other thing of note is that unlike c# scripts everything in a compute shader is in order, by which I mean you cannot use a function/variable/struct before you have declared it. If you try to run a function that hasn’t been declared above it in the code it will not work.

You can use multiple different kinds of variables in the compute shader, to set a value to a variable you can either do it inside the compute shader or set the value in the c# script using for example shader.SetInt(“int name in shder”, int value).
Two of the kinds of variables are special, those being the Texture variable and the ComputeBuffer variable. Whilst you can only set the value of the other variables and cannot read them afterwards, you can read back the values of these variables. In a way they are what you get back out of the compute shader.

``````//Setting the data array to the buffer
buffer.SetData(data);
//Getting the data back from the buffer into the data array
buffer.GetData(data);
``````

Another way they are special is that when you set the value of a texture/buffer you need to specify which kernel is going to use it. So if you are planning to use kernel 0 you need to declare it, for example shader.SetBuffer(0, “compute buffer name in shader”, ComputeBuffer).
A few side notes with structured buffers, when you create them they take up space in memory, after you are done using them you should release them to avoid a memory leak. This is simple to do by using the StucturedBuffer.Release() function.
When you declare a new ComputeBuffer, you need to specify the size of the array and the size of every entry in bytes. The size is simply the same as the array of data you want to insert into the buffer.
Do keep in mind that you need to declare the struct you are using in both the c# script as well as the compute shader.

After you have written your compute shader code and set al the variables you are going to use in the kernel you need to do one final thing, calculate the number of operations you want to do per dimension. For instance if you plan to go through a 2d array/ texture with 160x and 80y you divide them by the number of GPU work groups you are planning to use for the x, y and z axis. In the case of (8,8,1) you divide both the x and the y numbers by 8 and ceil the results to int.

``````//Example of calculating the threads

A slight warning though, because you are rounding up you might get more threads than intended. This might cause some bugs as it did in my project. It is recommended that when possible you keep the thread groups divisible by the number of GPU work groups per axis.

Now you can dispatch a kernel. You do this by using shader.Dispatch(kernel number, amount x, amount y, amount z). If you’re not planning to use z or even y keep the value at 1.

``````//Example of dispatching kernel 0 of shader

Afterward you either use the texture/buffer directly, use it in another (compute)shader or read it into another form of data.

## My product

In a previous assignment I created a voxel cube mesh generator that creates a surface of grass on top of a big of stone with caves going through it. Whilst the results I got from the generation were to my wishes the generation of the data surely wasn’t at all. To create a dataset of 100 by 100 by 150 cubes took around 33.5 seconds not including creating the meshes themselves. My goal was to speed up this generation using compute shaders.

The generation follows a few steps namely:

• Creating surface Relief using perlin noise
• Adding grass and dirt to the surface by checking the block above for air, grass or dirt
• Creating an initial state for the cavemap by filling it semi randomly
• Creating a cave map using cellular automata from the initial state
• Combining the surface data and cave data into one dataset
• Turning the dataset into meshes to create a world

I changed every step named above in order into compute shaders except for the mesh creation because I ran out of time.

Since I was using 3D arrays I split the GPU work groups to (4,4,4). This caused a bug however when I used it on my dataset of 100 by 150 by 100 with chunks of 25*25*25 but because 150 isn’t neatly divisible by 4 and with rounding you are left with 150/4 = 37.5 -> 38 and 38 * 4 = 152. This caused a weird bug on the first two layers of my dataset. The simple solution I used was making every chunk 16*16*16. This would make it so that axis totals are always divisible by 4 no matter how many chunks I use.

## Surface relief

I wanted to use perlin noise to create a semi-random surface, the method is quite simple. I take a perlin noise value based on x and z and multiply it by the desired relief amount. If the y is lower than the result make it solid and otherwise make it air.

``````if ((y < surfaceHeight * ySize - (surfaceRelief * float(2)) + noiseValue * surfaceDepth))
{
return 1;
}
else
{
return 0;
}``````

In the CPU version I used the Mathf.PerlinNoise() function for this however compute shaders don’t have access to this function. Instead I needed to get a perlin noise texture, set it in the compute shader as a variable and read the red value of the pixel corresponding to the x and z as the perlin noise value.

Another challenge was the data itself, I could not set and read an array to and from the compute shader, especially a 6D array that I was using for my dataset so instead I used a 1D array of cube structs that only contain one int value block type. I put this array into a StructuredBuffer and send it to the compute shader, after I finish all the steps I read the data back and translate it into the data structure my mesh generation is familiar with.

This step is easy, I only need to read what the block above the current block is and chance the current block based on the one above. The first time will only create a layer of grass, the second time will add a layer of dirt below that and after that it randomly puts dirt underneath or doesn’t to create a bit a relief in the dirt itself.

There is one problem with this process though, a compute shader doesn’t have a random function. But I found a formula that creates a semi random value between 0 and 1 based on a float2 input and using that function I have access to random values.

``````float rand(float2 pos)
{
float result = frac(sin(seed / 100.0f * dot(pos, float2(12.9898f, 78.233f))) * 43758.5453f);
seed += 1.0f;
return result;
}``````

## Initial cave map

Initially I still created the random cave map in the c# code, but when I noticed that this had quite the impact on the resulting speed after every other thing was changed into a compute shader I turned this into a compute shader as well. I thought this would be simple, I fill the map randomly with air and stone based on a variable % using the same random fuction I used in grass and dirt.

``````//Surface has a lower chance to have caves
if ((id.y > ySize * surfaceHeight - surfaceRelief && rand(rnd) < surfaceSolid) || rand(rnd) < initialSolid)
{
return 1;
}
else
{
return 0;
}``````

I did encounter two problems with this however, the first and not so important one was that even the surface got a lot of caves through it and I didn’t want that. This was an easy fix however by changing the chances for the initial caves state on the surface level to be less. The bigger problem was that my random function was 2D and not 3D, so while the x and z axis were random, all y axis were the same all the way to the top. After a lot of trying I found that by creating a random float2 from x, z and y, z then using that for a random value to give a desired random effect in all axis.

``````//Get a random float2 based on id so that all three axis are used
float2 rnd = float2(rand(float2(id.x, id.z)), rand(float2(id.y, id.z)));`````` This step is the preparation for the caves, 80% is filled and even more is filled on the surface.

## Creating caves with Cellular automata

This was the most important thing to change into a compute shader since almost the full 33.5 seconds of generation consisted of this step.

To achieve this I set the initial caves map as both the old read only caves map and the soon to be changed read write caves map, I did this so that the parallel changes of the changing caves map wouldn’t effect the current step of cellular automata.

For every cell I check the 26 cells surrounding it, count how many are alive and dead and then turn the current cell off or on based on the number of alive cells around it and the current state of the cell.

``````//If the cell is dead and the count is bigger than the birthcount the cell now lives
if (cubesOld[id.x + id.y * xSize + id.z * xSize * ySize].blockType == 0 && count > birthCount)
{
return 1;
}
//If the cell is alive and the count is smaller than the deathCount the cell dies
if (cubesOld[id.x + id.y * xSize + id.z * xSize * ySize].blockType == 1 && count < deathCount)
{
return 0;
}
//If neither if statement was succesfull do nothing
return cubesOld[id.x + id.y * xSize + id.z * xSize * ySize].blockType;``````

This had a problem however for the cells on the edges of the dataset. In order to check neighboring cells I added or subtracted 1 from an axis. This could however go into the negatives when at the 0 x, y and/or z edge and since the id of a cell is an uint3 that cannot go negative. Luckily I learned that you can just chance the uint3 into an int3 as the argument for the kernel function without problems so long as that it didn’t surpass the int limit.

Now to run the simulation I dispatched this kernel multiple times to simulate steps in the cellular automata, each time loading the data from the RW cave map and saving it into the read only cave map and the RW cave map before a dispatch. After this is completed I have a cavemap.

## Combining the surface and the caves

If there is one thing the compute shader is great for it would be a simple but repetitive task like this one, check the cell id on both maps, if either is air return air otherwise return the surface value.

``````//Check if the cell in either array is 0
if (surface[index].blockType == 0 || caves[index].blockType == 0)
{
return 0;
}
//If no 0 was found keep the cell as it was
else
{
return surface[index].blockType;
}``````

If you were to do this in c# you would have nested for loops for potentially millions of cells slowing the whole program down. The GPU however can do multiple in parallel at very fast speed.

Besides, both the surface and the caves are still in the StructuredBuffer data form for the GPU and turning this data back into data the CPU can use is one of the most expensive steps that will be left in the end, so doing that twice for both the surface and the caves would be inefficient.

## Transforming the data back

In the end this left only one last thing to do, turn the StucturedBuffer back into the data the rest of the code can use. First I read the back using StructuredBuffer.GetData(array you put the data in) after that it was a “simple” case of turning a 1D array into a 6D array which involved a lot of for loops, which is why this step was so expensive.

``````//Transform the linear buffer array to the 6 dimensional data array
data[cx, cy, cz, x, y, z] = cubes[x + cx*chunkSize
+ (y + cy*chunkSize) * (int)chunksPerAxis.x*chunkSize
+ (z + cz*chunkSize) * (int)chunksPerAxis.x*(int)chunksPerAxis.y*chunkSize*chunkSize
].blockType;``````

## The result

The end result was a generation time of about 0.15 seconds instead of the 33.5 seconds it took before the use of compute shaders for 1.5 million cubes. Of which, preparing the buffers is 0.01 seconds, doing the cellular automata is 0.04 seconds and the reading of the data is the longest (almost 50% of the total time) 0.065 seconds. And as a bonus it seems that the data generation time is as far as I can see BigO(n) so when I create a dataset of 15 million cubes instead of 1.5 million, the data generation step takes about 1.5 seconds.

The step that scales the worst right now is creating the meshes based on the dataset. So if I had more time to work on this project that would be the next thing I would try to turn into a compute shader.

## Conclusion

At first I planned to make a water caustics shader, however after the weeks progressed I noticed that this wasn’t possible with my current knowledge within the given timeframe.

During the first few weeks I learned how to use a compute shader to speed up calculations. When I noticed my lack of time I decided to switch to using compute shaders for speeding up a previous project.

I the future I plan to make further use of compute shaders because the speed difference it makes cannot be understated. The ease at which I could speed up my code by 220 times made the few extra steps well worth it.

In conclusion, I’m happy with what I managed to achieve and learned yet another useful technique for increasing the speed and performance of my projects in the future.

## Future work

If I plan to expand on this project, I would focus on also transforming the mesh creation into a compute shader since this is now by far the longest step still remaining in my project.

Other than that I would focus on the adding of more block types, more interesting terrain and perhaps biomes with different generation variations.

https://github.com/TyroRNG1324/Compute-voxel-mesh

## Sources

Kuri, D. (2018, May 3). GPU Ray Tracing in Unity – Part 1 t/m 3 – Three Eyed Games. Three Eyed Games. http://three-eyed-games.com/2018/05/03/gpu-ray-tracing-in-unity-part-1/