SSE for Cycles Perlin Noise #37963

New Issue

Sv. Lockal · 2013-12-27T12:47:12+01:00

Sv. Lockal commented

2013-12-27 12:47:12 +01:00

This is my first patch targeted to Cycles optimization. This patch task was submitted to discuss few ideas about optimization in cycles.

To start with, I profiled Cycles render with well-known BMW1M-MikePan scene. Here are the results I've got:

As you can see the slowest function is bvh_intersect_instancing, but it was already partially optimized with SSE intrinsics. However the CPI (cycles per instructions) rate is still too bad. There are too many branches and compilers can't understand what we want from them. For example, when compiler sees float t = f(a) && f(b); it almost always creates a conditional switch, even if f(x) is pure function . This is how && operator works in C. Same goes for ternary operator.

Perlin noise was the second function in this list, without SSE, without any branches, but in fact msvc/gcc/clang compilers generated a lot of branches there. With SSE intrinsics it became 60%-80% faster and 100% linear (checked with GCC Explorer ).

The overall result confirms the speedup: on my 12-core CPU BMW1M-MikePan now takes 2:02s instead of 2:13s.

Here is the patch: perlinsse.patch. I tested it with Qt Unit Test system (archive with Qt project: perlin_unit_test.7z), and optimized code replicates the behavior of non-optimized one. It can be optimized further with FMA and AVX intrinsics, but it won't be very useful because BF and linux distros release binary builds only for common SSE(2) CPUs.

There is no SSE code yet in CyclesSVM as far as I know. Is this even normal for SVM to contain platform-specific code?

This is my first patch targeted to Cycles optimization. This patch task was submitted to discuss few ideas about optimization in cycles. To start with, I profiled Cycles render with well-known BMW1M-MikePan scene. Here are the results I've got: ![before.png](https://archive.blender.org/developer/F57626/before.png) As you can see the slowest function is `bvh_intersect_instancing`, but it was already partially optimized with SSE intrinsics. However the CPI (cycles per instructions) rate is still too bad. There are too many branches and compilers can't understand what we want from them. For example, when compiler sees `float t = f(a) && f(b);` it almost always creates a conditional switch, even if f(x) is [pure function ](https://en.wikipedia.org/wiki/Pure_function). This is how `&&` operator works in C. Same goes for ternary operator. Perlin noise was the second function in this list, without SSE, without any branches, but in fact msvc/gcc/clang compilers generated *a lot of branches* there. With SSE intrinsics it became 60%-80% faster and 100% linear (checked with [GCC Explorer ](http://gcc.godbolt.org/)). ![after.png](https://archive.blender.org/developer/F57635/after.png) The overall result confirms the speedup: on my 12-core CPU BMW1M-MikePan now takes 2:02s instead of 2:13s. Here is the patch: [perlinsse.patch](https://archive.blender.org/developer/F57633/perlinsse.patch). I tested it with Qt Unit Test system (archive with Qt project: [perlin_unit_test.7z](https://archive.blender.org/developer/F57671/perlin_unit_test.7z)), and optimized code replicates the behavior of non-optimized one. It can be optimized further with FMA and AVX intrinsics, but it won't be very useful because BF and linux distros release binary builds only for common SSE(2) CPUs. There is no SSE code yet in CyclesSVM as far as I know. Is this even normal for SVM to contain platform-specific code?

Sv. Lockal commented

2013-12-27 12:47:12 +01:00

Changed status to: 'Open'

Sv. Lockal commented

2013-12-27 12:47:12 +01:00

Added subscriber: @Lockal

Sv. Lockal commented

2013-12-27 12:48:34 +01:00

Added subscribers: @brecht, @ThomasDinges

Martijn Berger commented

2013-12-27 12:55:17 +01:00

Added subscriber: @MartijnBerger

Martijn Berger commented

2013-12-27 12:55:17 +01:00

Very nice work. will test patch on MSVC2013 and report speedup on BMW

Martijn Berger commented

2013-12-27 13:12:58 +01:00

I am getting an 11% speedup on the BMW on initial measurement when compiling with MSVC 2013.

I think the platform specific code should not be a problem.

About AVX if the gain is large enough it might also be useful to have that code in place for when we ship an AVX kernel.

What software are you using for doing the profiling ?

I am getting an 11% speedup on the BMW on initial measurement when compiling with MSVC 2013. I think the platform specific code should not be a problem. About AVX if the gain is large enough it might also be useful to have that code in place for when we ship an AVX kernel. What software are you using for doing the profiling ?

Thomas Dinges commented

2013-12-27 13:14:29 +01:00

Very nice, I like this. Perlin noise is a common used texture, so the optimization here is welcome, and I think we should not worry about SSE code here.

Will review more in depth later, just a quick observation, I don't think you need to include <xmmintrin.h>, that's already included in util_types.h.

Very nice, I like this. Perlin noise is a common used texture, so the optimization here is welcome, and I think we should not worry about SSE code here. Will review more in depth later, just a quick observation, I don't think you need to include <xmmintrin.h>, that's already included in util_types.h.

Sv. Lockal commented

2013-12-27 13:56:16 +01:00

I use Intel VTune Amplifier from Intel Parallel Studio. It has noncommercial version for Linux.

There is no simple way to provide SSE and AVX kernels simultaneously in a single Blender build because of AVX-SSE transition penalties. These penalties may nullify the performance gain. But if we (or http:*www.graphicall.org, for example) compile a separate version with AVX flags (and other VEX-prefixed instructions), AVX-intrinsics could be useful. There is an obvious float- [x] vector in perlin, but it won't be easy to discover float- [x] vectors in other slow functions. I still believe that bvh_intersect_instancing can be twice as fast just with betterSSE usage.

I also forgot to mention that pure AVX has very narrow application, because it does not contain instructions for integers. AVX2 on the other hand contains such instructions, but AVX2 instructions are supported only by very recent Intel CPUs .

I use Intel VTune Amplifier from Intel Parallel Studio. It has [noncommercial version ](http://software.intel.com/en-us/non-commercial-software-development) for Linux. There is no simple way to provide SSE and AVX kernels simultaneously in a single Blender build because of [AVX-SSE transition penalties](http:*software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf). These penalties may nullify the performance gain. But if we (or http:*www.graphicall.org, for example) compile a separate version with AVX flags (and other VEX-prefixed instructions), AVX-intrinsics could be useful. There is an obvious float- [x] vector in perlin, but it won't be easy to discover float- [x] vectors in other slow functions. I still believe that `bvh_intersect_instancing` can be twice as fast just with *better*SSE usage. I also forgot to mention that pure AVX has very narrow application, because it does not contain instructions for integers. AVX2 on the other hand contains such instructions, but AVX2 instructions are supported only by [very recent Intel CPUs ](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2).

Brecht Van Lommel commented

2013-12-27 13:57:43 +01:00

This is great. We can indeed add an AVX compiled kernel if it gives a good speedup (we have one for SSE 4.1 already btw, not just SSE 2). SSE code in SVM is fine.

For bvh_intersect_instancing we can still do SSE optimizations for triangle intersection, but it requires changing the BVH data structurre and probably a different intersection algorithm. Further there is indeed a bunch of branching (and unpredictable memory access) in there, maybe you can think of some tricks to reduce it, but probably most of it simply required by the algorithm.

This is great. We can indeed add an AVX compiled kernel if it gives a good speedup (we have one for SSE 4.1 already btw, not just SSE 2). SSE code in SVM is fine. For `bvh_intersect_instancing` we can still do SSE optimizations for triangle intersection, but it requires changing the BVH data structurre and probably a different intersection algorithm. Further there is indeed a bunch of branching (and unpredictable memory access) in there, maybe you can think of some tricks to reduce it, but probably most of it simply required by the algorithm.

Brecht Van Lommel commented

2013-12-27 14:03:18 +01:00

I can imagine that AVX may not end up being a net win immediately, but I'm not quite sure why you would want to compile a separate version to put on e.g. graphical. We have a mechanism in Cycles that allows us to compile the kernel with different flags and put them all in one binary, then it chooses at runtime which kernel to use.

Martijn Berger commented

2013-12-27 14:08:31 +01:00

I can also recommend this scene: http://blenderartists.org/forum/showthread.php?288611-free-download-archviz-blender-scene&highlight=benchmark+scene for benchmarking.

Also I do know know it OSL has this optimization for perlin noise. Else it might be worthwhile to also port and submit it to them.

I can also recommend this scene: http://blenderartists.org/forum/showthread.php?288611-free-download-archviz-blender-scene&highlight=benchmark+scene for benchmarking. Also I do know know it OSL has this optimization for perlin noise. Else it might be worthwhile to also port and submit it to them.

Sv. Lockal commented

2013-12-27 15:06:14 +01:00

Brecht, nice to know about SSE 4.1. If Blender already contains code for CPU kernel selection, AVX is worth a try. Maybe there are relatively few AVX-SSE transitions so we can neglect these penalties.

The good thing about SSE is that it guarantees the same behavior with different compilers and compiler options. For example, bvh_intersect_instancing starts with bvh_inverse_direction call.

{
	/* avoid divide by zero (ooeps = exp2f(-80.0f)) */
	float ooeps = 0.00000000000000000000000082718061255302767487140869206996285356581211090087890625f;
	float3 idir;

	idir.x = 1.0f/((fabsf(dir.x) > ooeps)? dir.x: copysignf(ooeps, dir.x));
	idir.y = 1.0f/((fabsf(dir.y) > ooeps)? dir.y: copysignf(ooeps, dir.y));
	idir.z = 1.0f/((fabsf(dir.z) > ooeps)? dir.z: copysignf(ooeps, dir.z));

	return idir;
}```
This code with gcc results in 5 branches (3 branches with `-march=core-avx2`) and 43 instructions. It takes 7.5 seconds for BMW1M-MikePan scene, according to profiler.

Now look at this code:
```__m128 bvh_inverse_direction(__m128 *dir)
{
	__m128 maxfloat = _mm_castsi128_ps(_mm_set1_epi32(0x67800000)); // the highest representable float
	__m128 minfloat = _mm_castsi128_ps(_mm_set1_epi32(0xE7800000)); // the lowest representable float
	return _mm_min_ps(_mm_max_ps(_mm_div_ps(_mm_set1_ps(1.0f), *dir), minfloat), maxfloat);
}```
5 instructions exactly with every compiler, no branches. But such code is hard to read and magic is everywhere.

Also, do we have any place in Blender to store unit-tests?

Brecht, nice to know about SSE 4.1. If Blender already contains code for CPU kernel selection, AVX is worth a try. Maybe there are relatively few AVX-SSE transitions so we can neglect these penalties. The good thing about SSE is that it guarantees the same behavior with different compilers and compiler options. For example, `bvh_intersect_instancing` starts with `bvh_inverse_direction` call. ```ccl_device_inline float3 bvh_inverse_direction(float3 dir) { /* avoid divide by zero (ooeps = exp2f(-80.0f)) */ float ooeps = 0.00000000000000000000000082718061255302767487140869206996285356581211090087890625f; float3 idir; idir.x = 1.0f/((fabsf(dir.x) > ooeps)? dir.x: copysignf(ooeps, dir.x)); idir.y = 1.0f/((fabsf(dir.y) > ooeps)? dir.y: copysignf(ooeps, dir.y)); idir.z = 1.0f/((fabsf(dir.z) > ooeps)? dir.z: copysignf(ooeps, dir.z)); return idir; }``` This code with gcc results in 5 branches (3 branches with `-march=core-avx2`) and 43 instructions. It takes 7.5 seconds for BMW1M-MikePan scene, according to profiler. Now look at this code: ```__m128 bvh_inverse_direction(__m128 *dir) { __m128 maxfloat = _mm_castsi128_ps(_mm_set1_epi32(0x67800000)); // the highest representable float __m128 minfloat = _mm_castsi128_ps(_mm_set1_epi32(0xE7800000)); // the lowest representable float return _mm_min_ps(_mm_max_ps(_mm_div_ps(_mm_set1_ps(1.0f), *dir), minfloat), maxfloat); }``` 5 instructions exactly with every compiler, no branches. But such code is hard to read and magic is everywhere. Also, do we have any place in Blender to store unit-tests?

Martijn Berger commented

2013-12-27 16:03:14 +01:00

@Lockal give your real name to Ton please he is asking for it on IRC

also "Also, do we have any place in Blender to store unit-tests?" I think we should actually really do this. like you say magic is hard to read and tests will help keep the behaviour in check.

@Lockal give your real name to Ton please he is asking for it on IRC also "Also, do we have any place in Blender to store unit-tests?" I think we should actually really do this. like you say magic is hard to read and tests will help keep the behaviour in check.

Thomas Dinges commented

2013-12-27 16:06:03 +01:00

Wow you managed to bring down the bvh_inverse_direction() down from 43 instructions to 5? This should give a noticeable boost.

Thomas Dinges commented

2013-12-27 17:54:18 +01:00

Tested the Perlin SSE patch with bmw.blend (from test svn suite).

Intel Ivy Bridge, Linux, gcc: 2.07min >> 1.51min (200 samples)

Intel Sandy Bridge, Windows, MSVC 2008: 2.20min >> 1.59min (100 Samples).

So this looks good to me, and could be commited I guess?

Tested the Perlin SSE patch with bmw.blend (from test svn suite). Intel Ivy Bridge, Linux, gcc: 2.07min >> 1.51min (200 samples) Intel Sandy Bridge, Windows, MSVC 2008: 2.20min >> 1.59min (100 Samples). So this looks good to me, and could be commited I guess?

Brecht Van Lommel commented

2013-12-27 18:12:38 +01:00

The perlin patch looks good to commit to me, if the <xmmintrin.h> include is removed.

At some point we should refactor utility defines like FMA, BROADCAST_I, BROADCAST_F, and the shuffle functions in util_types.h into an util_simd.h file.

The perlin patch looks good to commit to me, if the <xmmintrin.h> include is removed. At some point we should refactor utility defines like FMA, BROADCAST_I, BROADCAST_F, and the shuffle functions in util_types.h into an util_simd.h file.

Thomas Dinges commented

2013-12-27 18:52:26 +01:00

Commited the Perlin patch, many thanks @Lockal! https://developer.blender.org/rBa92abf5089e152d8d1b1fd95278b307f56b6a193

@brecht: Refactoring this sounds like a good idea, will look into it.

Commited the Perlin patch, many thanks @Lockal! https://developer.blender.org/rBa92abf5089e152d8d1b1fd95278b307f56b6a193 @brecht: Refactoring this sounds like a good idea, will look into it.

Thomas Dinges commented

2013-12-27 20:13:19 +01:00

@brecht: Here the patch for the refactor: simd.diff

The SSE_ prefix is not needed for the defines, but as they are valid in the entire CCL namespace, maybe better be careful.

Edit: Lockal suggested to rewrite the macros as inline functions too?

@brecht: Here the patch for the refactor: [simd.diff](https://archive.blender.org/developer/F57811/simd.diff) The SSE_ prefix is not needed for the defines, but as they are valid in the entire CCL namespace, maybe better be careful. Edit: Lockal suggested to rewrite the macros as inline functions too?

Brecht Van Lommel commented

2013-12-27 20:46:12 +01:00

Yes, making the macros inline functions is a bit nicer, it's also possible to drop the SSE_ prefix then because there will be no conflict then due to operator overloading. Further looks good to commit.

Krantz Geoffroy commented

2013-12-27 23:23:24 +01:00

Added subscriber: @Geoffroykrantz

Krantz Geoffroy commented

2013-12-27 23:23:24 +01:00

Crazy!!
Simple test with a cube and a plane, both with perlin noise.
Whitout: 1:41.59
With 39.87

Crazy!! Simple test with a cube and a plane, both with perlin noise. Whitout: 1:41.59 With 39.87

Thomas Dinges commented

2013-12-28 16:36:17 +01:00

Lockal: I profiled a bit on Windows, and on the BMW scene, bvh now takes 33% of the render time.

If I render hair though (hair.blend from here https://svn.blender.org/svnroot/bf-blender/trunk/lib/tests/cycles/), I get 90%, I think optimizing this will give a very big boost. Afaik hair is not SIMD optimized at all yet?

Lockal: I profiled a bit on Windows, and on the BMW scene, bvh now takes 33% of the render time. If I render hair though (hair.blend from here https://svn.blender.org/svnroot/bf-blender/trunk/lib/tests/cycles/), I get 90%, I think optimizing this will give a very big boost. Afaik hair is not SIMD optimized at all yet?

Sv. Lockal commented

2013-12-30 09:17:00 +01:00

Changed status from 'Open' to: 'Resolved'

Sv. Lockal closed this issue

2013-12-30 09:17:00 +01:00

Sv. Lockal self-assigned this 2013-12-30 09:17:00 +01:00

Sv. Lockal commented

2013-12-30 09:17:00 +01:00

Thank you everyone, this report can be closed now. The patch was committed, my questions were answered. I'll push new patches via differential or directly into master branch (for trivial ones).

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

SSE for Cycles Perlin Noise #37963