Sunday, January 25, 2015

An Epic Tale Of Rendering Part 2: All the math's

 Hello!

Second post coming up! The discussion of this week will be about my work on the Math and OpenCL libraries. I will start with some math.

Math:
So far I've implemented the following:
  1. Vector, Normal, Point and Vector4 classes
  2. Matrix4x4 class with all necessary transformations including Inverse, Transpose, Rotate, Scale, etc.
  3. Matrix4x4 and Vector4 are fully SSE optimized and Vector/Normal/Point classes all come in SSE and non SSE versions (memory usage)

The first (and most important) decision I had to make was: "To SIMD or not to SIMD". I decided on the full SIMD approach. SIMD stands for Single Instruction Multiple Data. The main idea is: A single (cpu) instruction is executed and multiple data-elements are processed using this single instruction. Here's some pseudo code to explain:

1: Normal version
float x1 = 2.0, x2 = 4.0;
float y1 = 3.0, y2 = 6.0;
float z1 = 4.0, z2 = 8.0;
float w1 = 5.0, w2 = 10.0;
float add = 23.0;

// 8 instructions (excluding =)
float resultX = x1 * x2 + add;
float resultY = y1 * y2 + add;
float resultZ = z1 * z2 + add;
float resultW = w1 * w2 + add;

2: SIMD version
float4 xyzwOne(2.0, 3.0, 4.0, 5.0) ;
float4 xyzwTwo(4.0, 6.0, 8.0, 10.0) ;
float4 add(23.0);
// 2 instructions: multiply, add
float4  result = xyzwOne * xyzwTwo + add;

// Or just use 1 instruction: mul_add
float4  result = mul_add(xyzwOne, xyzwTwo, add);

For more in depth detail about this topic:
http://fastcpp.blogspot.de/
http://www.thomasdideriksen.dk/misc/SIMD/sse_in_real_applications.pdf
http://download.intel.com/design/PentiumIII/sml/24504301.pdf
etc...

For the fun of it I also implemented normal non SIMD versions of the most common data types for memory usage and (speed)test purposes.

The next step was to test for correctness and speed. The most important and fun part of using SIMD is the potential speed gains we can get! For this I wrote some unit tests to compare various commonly used operations:
  1. Vector Addition
  2. Vector Multiplication
  3. Vector Division
  4. Dot Product between two vectors
  5. Cross Product between two vectors
  6. Matrix4x4 / vector multiplication
My test were done in the following form (Pseudo code):

Vector vec1(1, 2, 3), vec2(3, 4, 5);
for (i = 0; i < 2^power; ++i)
  vec1 = vec1 + vec2;

In total for each operation 10 tests are done with increasing computational time i.e. we start the first time with ~1 million and end with ~1 billion iterations. We time our code and add up to a total time for each operation (SIMD and non SIMD). Finally, we do (Non-SIMD / SIMD) * 100 to get a percentage of how much faster SIMD version is. I've also made sure that the compiler did not optimize away the loops. Here are the results on my system (CPU is most important here!):

Windows 7, Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 8GB of ram.
  1. SIMD / Non-SIMD addition         : 5630.32 ms / 6291.36 ms      -> SIMD 1.11x faster
  2. SIMD / Non-SIMD multiplication: 6768.39 ms / 7177.41 ms      -> SIMD 1.06x faster
  3. SIMD / Non-SIMD division          : 6857.39 ms / 10171.6 ms      -> SIMD 1.48x faster
  4. SIMD / Non-SIMD dot product     : 3508.2 ms  /   3502.2 ms      -> SIMD 0.983x faster
  5. SIMD / Non-SIMD cross product  : 3510.2 ms / 10938.6 ms       -> SIMD 3.11x faster
  6. SIMD / Non-SIMD matrix / vector mul: 14908.9 ms / 18225 ms -> SIMD 1.22x faster
There is clearly still some room for improvement! The compiler is very smart and will use SIMD operations on normal (non-SIMD) code wherever it can. However, it doesn't beat handwritten SIMD code!

 OpenCL:

I will not go into to much detail about OpenCL. The basic idea is that you can use platform independent code to use different kinds of "Compute" devices like cpu's, gpu's, etc. The devices can be used in parallel to perform tasks like physics simulations, weather prediction, A.I. and you name it! What I've done so far is: 
  1. Detect OpenCL devices on the machine
  2. Allocate memory on these devices
  3. Compile program
I am basically writing a library so I never have to worry about OpenCL setup related stuff ever again! This library can then be re-used in other projects too. For more information on OpenCL here are some useful links
  1. https://www.khronos.org/message_boards/forumdisplay.php/87-OpenCL
  2. https://www.evl.uic.edu/kreda/gpu/image-convolution/
  3. http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%200&referringTitle=OpenCL%20Tutorials
  4. http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-resources/introductory-tutorial-to-opencl/
  5. http://enja.org/2010/07/13/adventures-in-opencl-part-1-getting-started/
  6. http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/06-intro_to_opencl.pdf
  7. etc..
 My plans for the coming weeks is to finish up the Math library and finalize the OpenCL library. After this I will start working on a lightweight threading library that I can use in my renderer.

Stay tuned for more next week (ish)! :)


No comments:

Post a Comment