GPU ray tracing with OpenCL

Hello!

Second post coming up! The discussion of this week will be about my work on the Math and OpenCL libraries. I will start with some math.

Math:
So far I've implemented the following:

Vector, Normal, Point and Vector4 classes
Matrix4x4 class with all necessary transformations including Inverse, Transpose, Rotate, Scale, etc.
Matrix4x4 and Vector4 are fully SSE optimized and Vector/Normal/Point classes all come in SSE and non SSE versions (memory usage)

The first (and most important) decision I had to make was: "To SIMD or not to SIMD". I decided on the full SIMD approach. SIMD stands for Single Instruction Multiple Data. The main idea is: A single (cpu) instruction is executed and multiple data-elements are processed using this single instruction. Here's some pseudo code to explain:

1: Normal version
float x1 = 2.0, x2 = 4.0;
float y1 = 3.0, y2 = 6.0;
float z1 = 4.0, z2 = 8.0;
float w1 = 5.0, w2 = 10.0;
float add = 23.0;

// 8 instructions (excluding =)
float resultX = x1 * x2 + add;
float resultY = y1 * y2 + add;
float resultZ = z1 * z2 + add;
float resultW = w1 * w2 + add;

2: SIMD version
float4 xyzwOne(2.0, 3.0, 4.0, 5.0) ;
float4 xyzwTwo(4.0, 6.0, 8.0, 10.0) ;
float4 add(23.0);
// 2 instructions: multiply, add
float4 result = xyzwOne * xyzwTwo + add;

// Or just use 1 instruction: mul_add
float4 result = mul_add(xyzwOne, xyzwTwo, add);

For more in depth detail about this topic:
http://fastcpp.blogspot.de/
http://www.thomasdideriksen.dk/misc/SIMD/sse_in_real_applications.pdf
http://download.intel.com/design/PentiumIII/sml/24504301.pdf
etc...

For the fun of it I also implemented normal non SIMD versions of the most common data types for memory usage and (speed)test purposes.

The next step was to test for correctness and speed. The most important and fun part of using SIMD is the potential speed gains we can get! For this I wrote some unit tests to compare various commonly used operations:

Vector Addition
Vector Multiplication
Vector Division
Dot Product between two vectors
Cross Product between two vectors
Matrix4x4 / vector multiplication

My test were done in the following form (Pseudo code):

Vector vec1(1, 2, 3), vec2(3, 4, 5);
for (i = 0; i < 2^power; ++i)
vec1 = vec1 + vec2;

In total for each operation 10 tests are done with increasing computational time i.e. we start the first time with ~1 million and end with ~1 billion iterations. We time our code and add up to a total time for each operation (SIMD and non SIMD). Finally, we do (Non-SIMD / SIMD) * 100 to get a percentage of how much faster SIMD version is. I've also made sure that the compiler did not optimize away the loops. Here are the results on my system (CPU is most important here!):

Windows 7, Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 8GB of ram.

SIMD / Non-SIMD addition : 5630.32 ms / 6291.36 ms -> SIMD 1.11x faster
SIMD / Non-SIMD multiplication: 6768.39 ms / 7177.41 ms -> SIMD 1.06x faster
SIMD / Non-SIMD division : 6857.39 ms / 10171.6 ms -> SIMD 1.48x faster
SIMD / Non-SIMD dot product : 3508.2 ms / 3502.2 ms -> SIMD 0.983x faster
SIMD / Non-SIMD cross product : 3510.2 ms / 10938.6 ms -> SIMD 3.11x faster
SIMD / Non-SIMD matrix / vector mul: 14908.9 ms / 18225 ms -> SIMD 1.22x faster

There is clearly still some room for improvement! The compiler is very smart and will use SIMD operations on normal (non-SIMD) code wherever it can. However, it doesn't beat handwritten SIMD code!

OpenCL:

I will not go into to much detail about OpenCL. The basic idea is that you can use platform independent code to use different kinds of "Compute" devices like cpu's, gpu's, etc. The devices can be used in parallel to perform tasks like physics simulations, weather prediction, A.I. and you name it! What I've done so far is:

Detect OpenCL devices on the machine
Allocate memory on these devices
Compile program

I am basically writing a library so I never have to worry about OpenCL setup related stuff ever again! This library can then be re-used in other projects too. For more information on OpenCL here are some useful links

My plans for the coming weeks is to finish up the Math library and finalize the OpenCL library. After this I will start working on a lightweight threading library that I can use in my renderer.

Stay tuned for more next week (ish)! :)

GPU ray tracing with OpenCL

Sunday, January 25, 2015

An Epic Tale Of Rendering Part 2: All the math's

Saturday, January 17, 2015

An Epic Tale Of Rendering Part 1

Blog Archive