SSE for Cycles Perlin Noise #37963

Closed
opened 2013-12-27 12:47:12 +01:00 by Sv. Lockal · 24 comments
Member

This is my first patch targeted to Cycles optimization. This patch task was submitted to discuss few ideas about optimization in cycles.

To start with, I profiled Cycles render with well-known BMW1M-MikePan scene. Here are the results I've got:
before.png
As you can see the slowest function is bvh_intersect_instancing, but it was already partially optimized with SSE intrinsics. However the CPI (cycles per instructions) rate is still too bad. There are too many branches and compilers can't understand what we want from them. For example, when compiler sees float t = f(a) && f(b); it almost always creates a conditional switch, even if f(x) is pure function . This is how && operator works in C. Same goes for ternary operator.

Perlin noise was the second function in this list, without SSE, without any branches, but in fact msvc/gcc/clang compilers generated a lot of branches there. With SSE intrinsics it became 60%-80% faster and 100% linear (checked with GCC Explorer ).

after.png

The overall result confirms the speedup: on my 12-core CPU BMW1M-MikePan now takes 2:02s instead of 2:13s.

Here is the patch: perlinsse.patch. I tested it with Qt Unit Test system (archive with Qt project: perlin_unit_test.7z), and optimized code replicates the behavior of non-optimized one. It can be optimized further with FMA and AVX intrinsics, but it won't be very useful because BF and linux distros release binary builds only for common SSE(2) CPUs.

There is no SSE code yet in CyclesSVM as far as I know. Is this even normal for SVM to contain platform-specific code?

This is my first patch targeted to Cycles optimization. This patch task was submitted to discuss few ideas about optimization in cycles. To start with, I profiled Cycles render with well-known BMW1M-MikePan scene. Here are the results I've got: ![before.png](https://archive.blender.org/developer/F57626/before.png) As you can see the slowest function is `bvh_intersect_instancing`, but it was already partially optimized with SSE intrinsics. However the CPI (cycles per instructions) rate is still too bad. There are too many branches and compilers can't understand what we want from them. For example, when compiler sees `float t = f(a) && f(b);` it almost always creates a conditional switch, even if f(x) is [pure function ](https://en.wikipedia.org/wiki/Pure_function). This is how `&&` operator works in C. Same goes for ternary operator. Perlin noise was the second function in this list, without SSE, without any branches, but in fact msvc/gcc/clang compilers generated *a lot of branches* there. With SSE intrinsics it became 60%-80% faster and 100% linear (checked with [GCC Explorer ](http://gcc.godbolt.org/)). ![after.png](https://archive.blender.org/developer/F57635/after.png) The overall result confirms the speedup: on my 12-core CPU BMW1M-MikePan now takes 2:02s instead of 2:13s. Here is the patch: [perlinsse.patch](https://archive.blender.org/developer/F57633/perlinsse.patch). I tested it with Qt Unit Test system (archive with Qt project: [perlin_unit_test.7z](https://archive.blender.org/developer/F57671/perlin_unit_test.7z)), and optimized code replicates the behavior of non-optimized one. It can be optimized further with FMA and AVX intrinsics, but it won't be very useful because BF and linux distros release binary builds only for common SSE(2) CPUs. There is no SSE code yet in CyclesSVM as far as I know. Is this even normal for SVM to contain platform-specific code?
Author
Member

Changed status to: 'Open'

Changed status to: 'Open'
Author
Member

Added subscriber: @Lockal

Added subscriber: @Lockal
Author
Member

Added subscribers: @brecht, @ThomasDinges

Added subscribers: @brecht, @ThomasDinges
Member

Added subscriber: @MartijnBerger

Added subscriber: @MartijnBerger
Member

Very nice work. will test patch on MSVC2013 and report speedup on BMW

Very nice work. will test patch on MSVC2013 and report speedup on BMW
Member

I am getting an 11% speedup on the BMW on initial measurement when compiling with MSVC 2013.

I think the platform specific code should not be a problem.

About AVX if the gain is large enough it might also be useful to have that code in place for when we ship an AVX kernel.

What software are you using for doing the profiling ?

I am getting an 11% speedup on the BMW on initial measurement when compiling with MSVC 2013. I think the platform specific code should not be a problem. About AVX if the gain is large enough it might also be useful to have that code in place for when we ship an AVX kernel. What software are you using for doing the profiling ?

Very nice, I like this. Perlin noise is a common used texture, so the optimization here is welcome, and I think we should not worry about SSE code here.

Will review more in depth later, just a quick observation, I don't think you need to include <xmmintrin.h>, that's already included in util_types.h.

Very nice, I like this. Perlin noise is a common used texture, so the optimization here is welcome, and I think we should not worry about SSE code here. Will review more in depth later, just a quick observation, I don't think you need to include <xmmintrin.h>, that's already included in util_types.h.
Author
Member

I use Intel VTune Amplifier from Intel Parallel Studio. It has noncommercial version for Linux.

There is no simple way to provide SSE and AVX kernels simultaneously in a single Blender build because of AVX-SSE transition penalties. These penalties may nullify the performance gain. But if we (or http:*www.graphicall.org, for example) compile a separate version with AVX flags (and other VEX-prefixed instructions), AVX-intrinsics could be useful. There is an obvious float- [x] vector in perlin, but it won't be easy to discover float- [x] vectors in other slow functions. I still believe that bvh_intersect_instancing can be twice as fast just with betterSSE usage.

I also forgot to mention that pure AVX has very narrow application, because it does not contain instructions for integers. AVX2 on the other hand contains such instructions, but AVX2 instructions are supported only by very recent Intel CPUs .

I use Intel VTune Amplifier from Intel Parallel Studio. It has [noncommercial version ](http://software.intel.com/en-us/non-commercial-software-development) for Linux. There is no simple way to provide SSE and AVX kernels simultaneously in a single Blender build because of [AVX-SSE transition penalties](http:*software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf). These penalties may nullify the performance gain. But if we (or http:*www.graphicall.org, for example) compile a separate version with AVX flags (and other VEX-prefixed instructions), AVX-intrinsics could be useful. There is an obvious float- [x] vector in perlin, but it won't be easy to discover float- [x] vectors in other slow functions. I still believe that `bvh_intersect_instancing` can be twice as fast just with *better*SSE usage. I also forgot to mention that pure AVX has very narrow application, because it does not contain instructions for integers. AVX2 on the other hand contains such instructions, but AVX2 instructions are supported only by [very recent Intel CPUs ](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2).

This is great. We can indeed add an AVX compiled kernel if it gives a good speedup (we have one for SSE 4.1 already btw, not just SSE 2). SSE code in SVM is fine.

For bvh_intersect_instancing we can still do SSE optimizations for triangle intersection, but it requires changing the BVH data structurre and probably a different intersection algorithm. Further there is indeed a bunch of branching (and unpredictable memory access) in there, maybe you can think of some tricks to reduce it, but probably most of it simply required by the algorithm.

This is great. We can indeed add an AVX compiled kernel if it gives a good speedup (we have one for SSE 4.1 already btw, not just SSE 2). SSE code in SVM is fine. For `bvh_intersect_instancing` we can still do SSE optimizations for triangle intersection, but it requires changing the BVH data structurre and probably a different intersection algorithm. Further there is indeed a bunch of branching (and unpredictable memory access) in there, maybe you can think of some tricks to reduce it, but probably most of it simply required by the algorithm.

I can imagine that AVX may not end up being a net win immediately, but I'm not quite sure why you would want to compile a separate version to put on e.g. graphical. We have a mechanism in Cycles that allows us to compile the kernel with different flags and put them all in one binary, then it chooses at runtime which kernel to use.

I can imagine that AVX may not end up being a net win immediately, but I'm not quite sure why you would want to compile a separate version to put on e.g. graphical. We have a mechanism in Cycles that allows us to compile the kernel with different flags and put them all in one binary, then it chooses at runtime which kernel to use.
Member

I can also recommend this scene: http://blenderartists.org/forum/showthread.php?288611-free-download-archviz-blender-scene&highlight=benchmark+scene for benchmarking.

Also I do know know it OSL has this optimization for perlin noise. Else it might be worthwhile to also port and submit it to them.

I can also recommend this scene: http://blenderartists.org/forum/showthread.php?288611-free-download-archviz-blender-scene&highlight=benchmark+scene for benchmarking. Also I do know know it OSL has this optimization for perlin noise. Else it might be worthwhile to also port and submit it to them.
Author
Member

Brecht, nice to know about SSE 4.1. If Blender already contains code for CPU kernel selection, AVX is worth a try. Maybe there are relatively few AVX-SSE transitions so we can neglect these penalties.

The good thing about SSE is that it guarantees the same behavior with different compilers and compiler options. For example, bvh_intersect_instancing starts with bvh_inverse_direction call.

{
	/* avoid divide by zero (ooeps = exp2f(-80.0f)) */
	float ooeps = 0.00000000000000000000000082718061255302767487140869206996285356581211090087890625f;
	float3 idir;

	idir.x = 1.0f/((fabsf(dir.x) > ooeps)? dir.x: copysignf(ooeps, dir.x));
	idir.y = 1.0f/((fabsf(dir.y) > ooeps)? dir.y: copysignf(ooeps, dir.y));
	idir.z = 1.0f/((fabsf(dir.z) > ooeps)? dir.z: copysignf(ooeps, dir.z));

	return idir;
}```
This code with gcc results in 5 branches (3 branches with `-march=core-avx2`) and 43 instructions. It takes 7.5 seconds for BMW1M-MikePan scene, according to profiler.

Now look at this code:
```__m128 bvh_inverse_direction(__m128 *dir)
{
	__m128 maxfloat = _mm_castsi128_ps(_mm_set1_epi32(0x67800000)); // the highest representable float
	__m128 minfloat = _mm_castsi128_ps(_mm_set1_epi32(0xE7800000)); // the lowest representable float
	return _mm_min_ps(_mm_max_ps(_mm_div_ps(_mm_set1_ps(1.0f), *dir), minfloat), maxfloat);
}```
5 instructions exactly with every compiler, no branches. But such code is hard to read and magic is everywhere.

Also, do we have any place in Blender to store unit-tests?
Brecht, nice to know about SSE 4.1. If Blender already contains code for CPU kernel selection, AVX is worth a try. Maybe there are relatively few AVX-SSE transitions so we can neglect these penalties. The good thing about SSE is that it guarantees the same behavior with different compilers and compiler options. For example, `bvh_intersect_instancing` starts with `bvh_inverse_direction` call. ```ccl_device_inline float3 bvh_inverse_direction(float3 dir) { /* avoid divide by zero (ooeps = exp2f(-80.0f)) */ float ooeps = 0.00000000000000000000000082718061255302767487140869206996285356581211090087890625f; float3 idir; idir.x = 1.0f/((fabsf(dir.x) > ooeps)? dir.x: copysignf(ooeps, dir.x)); idir.y = 1.0f/((fabsf(dir.y) > ooeps)? dir.y: copysignf(ooeps, dir.y)); idir.z = 1.0f/((fabsf(dir.z) > ooeps)? dir.z: copysignf(ooeps, dir.z)); return idir; }``` This code with gcc results in 5 branches (3 branches with `-march=core-avx2`) and 43 instructions. It takes 7.5 seconds for BMW1M-MikePan scene, according to profiler. Now look at this code: ```__m128 bvh_inverse_direction(__m128 *dir) { __m128 maxfloat = _mm_castsi128_ps(_mm_set1_epi32(0x67800000)); // the highest representable float __m128 minfloat = _mm_castsi128_ps(_mm_set1_epi32(0xE7800000)); // the lowest representable float return _mm_min_ps(_mm_max_ps(_mm_div_ps(_mm_set1_ps(1.0f), *dir), minfloat), maxfloat); }``` 5 instructions exactly with every compiler, no branches. But such code is hard to read and magic is everywhere. Also, do we have any place in Blender to store unit-tests?
Member

@Lockal give your real name to Ton please he is asking for it on IRC

also "Also, do we have any place in Blender to store unit-tests?" I think we should actually really do this. like you say magic is hard to read and tests will help keep the behaviour in check.

@Lockal give your real name to Ton please he is asking for it on IRC also "Also, do we have any place in Blender to store unit-tests?" I think we should actually really do this. like you say magic is hard to read and tests will help keep the behaviour in check.

Wow you managed to bring down the bvh_inverse_direction() down from 43 instructions to 5? This should give a noticeable boost.

Wow you managed to bring down the bvh_inverse_direction() down from 43 instructions to 5? This should give a noticeable boost.

Tested the Perlin SSE patch with bmw.blend (from test svn suite).

Intel Ivy Bridge, Linux, gcc: 2.07min >> 1.51min (200 samples)

Intel Sandy Bridge, Windows, MSVC 2008: 2.20min >> 1.59min (100 Samples).

So this looks good to me, and could be commited I guess?

Tested the Perlin SSE patch with bmw.blend (from test svn suite). Intel Ivy Bridge, Linux, gcc: 2.07min >> 1.51min (200 samples) Intel Sandy Bridge, Windows, MSVC 2008: 2.20min >> 1.59min (100 Samples). So this looks good to me, and could be commited I guess?

The perlin patch looks good to commit to me, if the <xmmintrin.h> include is removed.

At some point we should refactor utility defines like FMA, BROADCAST_I, BROADCAST_F, and the shuffle functions in util_types.h into an util_simd.h file.

The perlin patch looks good to commit to me, if the <xmmintrin.h> include is removed. At some point we should refactor utility defines like FMA, BROADCAST_I, BROADCAST_F, and the shuffle functions in util_types.h into an util_simd.h file.

Commited the Perlin patch, many thanks @Lockal! https://developer.blender.org/rBa92abf5089e152d8d1b1fd95278b307f56b6a193

@brecht: Refactoring this sounds like a good idea, will look into it.

Commited the Perlin patch, many thanks @Lockal! https://developer.blender.org/rBa92abf5089e152d8d1b1fd95278b307f56b6a193 @brecht: Refactoring this sounds like a good idea, will look into it.

@brecht: Here the patch for the refactor: simd.diff

The SSE_ prefix is not needed for the defines, but as they are valid in the entire CCL namespace, maybe better be careful.

Edit: Lockal suggested to rewrite the macros as inline functions too?

@brecht: Here the patch for the refactor: [simd.diff](https://archive.blender.org/developer/F57811/simd.diff) The SSE_ prefix is not needed for the defines, but as they are valid in the entire CCL namespace, maybe better be careful. Edit: Lockal suggested to rewrite the macros as inline functions too?

Yes, making the macros inline functions is a bit nicer, it's also possible to drop the SSE_ prefix then because there will be no conflict then due to operator overloading. Further looks good to commit.

Yes, making the macros inline functions is a bit nicer, it's also possible to drop the SSE_ prefix then because there will be no conflict then due to operator overloading. Further looks good to commit.

Added subscriber: @Geoffroykrantz

Added subscriber: @Geoffroykrantz

Crazy!!
Simple test with a cube and a plane, both with perlin noise.
Whitout: 1:41.59
With 39.87

Crazy!! Simple test with a cube and a plane, both with perlin noise. Whitout: 1:41.59 With 39.87

Lockal: I profiled a bit on Windows, and on the BMW scene, bvh now takes 33% of the render time.

If I render hair though (hair.blend from here https://svn.blender.org/svnroot/bf-blender/trunk/lib/tests/cycles/), I get 90%, I think optimizing this will give a very big boost. Afaik hair is not SIMD optimized at all yet?

Lockal: I profiled a bit on Windows, and on the BMW scene, bvh now takes 33% of the render time. If I render hair though (hair.blend from here https://svn.blender.org/svnroot/bf-blender/trunk/lib/tests/cycles/), I get 90%, I think optimizing this will give a very big boost. Afaik hair is not SIMD optimized at all yet?
Author
Member

Changed status from 'Open' to: 'Resolved'

Changed status from 'Open' to: 'Resolved'
Sv. Lockal self-assigned this 2013-12-30 09:17:00 +01:00
Author
Member

Thank you everyone, this report can be closed now. The patch was committed, my questions were answered. I'll push new patches via differential or directly into master branch (for trivial ones).

Thank you everyone, this report can be closed now. The patch was committed, my questions were answered. I'll push new patches via differential or directly into master branch (for trivial ones).
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#37963
No description provided.