Page MenuHome

20% to 140% Speedup for the OpenCL kernel
AbandonedPublic

Authored by mathieu menuet (bliblubli) on Sep 24 2016, 7:48 PM.

Details

Summary

This patch reassign some nodes to other group level, this only brings a 60% performance improvement for the group level 0 (for clay renders for example).
It also adds some more features, like transparent shadows which are activated by default in 2.78 with no way to deactivate them anymore. So this resolves a 10-15% speed regression.
Some new features were also added which brings a 140% speedup when rendering with micro displacement. This last change of course brings some more kernel recompilation. But even including the kernel compilation times (which are also greatly reduced), the speed gain is still very high. So only downside is that a few more MB are going to be taken on the HDD, but in times where a 3TB HDD cost 70€, it's not a real problem.

A one week test has been done on BA and no reproducible bug has been reported so far https://blenderartists.org/forum/showthread.php?407044-Improved-OpenCL-build-beta. Only one user reported problems while rendering with HDRI but nobody else could reproduce the bug and no file was submitted. You can also see the reported benchmark results there. Tested platforms from me and BA users where Linux and Windows with different drivers and graphic cards from GCN 1.0 generation to RX 480.

Good side effect are also that some users reported that it also solved some rendering problems on old generation compared to master and RC2.

Diff Detail

Event Timeline

mathieu menuet (bliblubli) retitled this revision from to 20% to 140% Speedup for the OpenCL kernel.Sep 24 2016, 7:48 PM
mathieu menuet (bliblubli) updated this object.
mathieu menuet (bliblubli) set the repository for this revision to rC Cycles.
mathieu menuet (bliblubli) updated this revision to Diff 7497.
Brecht Van Lommel (brecht) requested changes to this revision.

The transparent shadows part seems fine to me. I need to look at the specific of the changes with the grouping/features, but some initial comments.

intern/cycles/kernel/kernel_types.h
744

Was this intended to be included? Did you find cases where SoA helps?

Ideally we could remove SoA an simplify the code a lot (D2246).

intern/cycles/kernel/svm/svm.h
448

This line should be below # if NODES_FEATURE(NODE_FEATURE_VEC_CURVES).

This revision now requires changes to proceed.Sep 24 2016, 8:42 PM

Explained the strange parts of the code inline. SoA brings improvements after the changes this patch brings. Builds done with this patch allows to quickly switch between SoA and AoS with the supported/experimental switch to allow easy benchmarking.

intern/cycles/kernel/kernel_types.h
744

Yes, in the Barcelona scene for example, SoA brings about 10% speedup over AOS. In the classroom scene also it's better with SoA. Only together with the new node group level is SoA better than AoS. This is why it's included in the patch. Change was made to make it easily switchable with one build to allow easy benchmark on different platform and configurations.
According to my test, in all non-architectural scenes, AoS is faster than or as fast as SoA, that's why it's the default for supported kernel. But more results of SoA vs AoS now make sense again before choosing one.

intern/cycles/kernel/svm/svm.h
448

It was originally, but then surfaces with RGB curve node in their material would render black if no vector curves node was in the scene. So it's a bugfix actually. Although it look strange to have an empty case, all my test files rendered correctly after this change.

Benchmarks seem fine but differences are smaller, not as big as tests on Windows, drivers are probably quite different. Still very decent speedups.

Linux R9 380 render time
BMW-4.1%
Fishy Cat+1.4%
Pabellon Barcelona-6.6%
Classroom-13.3%
Koro-1.3%
intern/cycles/kernel/kernel_types.h
744

Ok, for testing the patch that can be useful, but we're not going to commit it that way. Personally I don't think we should keep a setting like this at all if it speeds up some scenes while slowing down others in a way that's unpredictable.

However, since you claim a 140% microdisplacement speedup, wouldn't that be related to this SoA/AoS change since that's also tied to the experimental toggle? Or is there something else in this patch responsible for that, because I don't see anything else obviously related?

intern/cycles/kernel/svm/svm.h
448

That doesn't make sense though, if moving this case statement one line lower doesn't work then something must be broken. It's not an empty case, it will fall through to the next case.

By the way, we have to be careful with comparing render times between 2.78 RCs and a custom builds, since there have been Cycles optimizations in master and they might also affect performance. I don't know if that's the case here, just noting.

For example rB8f28441487d5: Cycles: Adaptive isolation, {b459d9f46c5355841aae3048c27cf778c0291566}, {0de69e56b43f194d3d79ad28c3cd6e49e88aa8d2}, {1558f5b6602ebe8ba31455b1837b2594d5c7f264} are not in RC2.

Thanks for the test on 380 Brecht. Was it on Linux? With experimental (SoA) or supported (AoS).

My test were compared to the commit this diff is build upon, so all the master optimizations were in both builds. The huge speedup of 140% is on micro displacement scene on windows reported here https://blenderartists.org/forum/showthread.php?407044-Improved-OpenCL-build-beta&p=3100839&viewfull=1#post3100839 and is compared against RC2. In my case, I got "only" 90% more perf using this scene on Linux: https://developer.blender.org/F360889 from this report https://developer.blender.org/T49371. I'll recheck using latest buildbot to ensure no other optimization did influence the results.

Before this patch, Linux was faster than windows in most scene. After this patch, the differences are very small, so it may add to the performance improvement. It may also be due to newer drivers (I installed newly published 16.9.2 to benchmark on Windows)

Most user on BA didn't specify their hardware, but those who benefit the most from this patch are the RX 480 it seems.

@Brecht Van Lommel (brecht) On windows (to get results comparable to BA and because most user are on windows) to render https://developer.blender.org/F360889 with one tile:
Using buildbot e12f5b6 (VS2013) it renders in 2:19.60 with 25.21s preparation so about 1min55s of GPU rendering
Using the same commit plus this patch (VS2015) it renders in 1:18.15 with 18.44s preparation so about 1min of GPU rendering.
so it's 115s for master(e12f5b6) against 60sec for master(e12f5b6+patch) which allows to render 92% = nearly 2x more frames in the same time.
I'll investigate SoA's role further in this case.

Re-arranging node groups seems specious to me. Surely it's always possible to re-arrange them in a way that only nodes used by particular file are compiled in. This is a wrong optimization route tho. First of all, current groups are based on statistics from caminandes, gooseberry and tears of steel shader networks. Second of all, while some trivial scenes will become faster, you'll use much-much more nodes in a production shots.

Surely one might argue that per-node based compilation is the way to go, but this will cause too many kernel re-compilation and hence is not really acceptable.

Now, for the AoS/SoA. Surely i'm biased toward removing SoA in order to reduce code divergence between opencl split kernel and regular kernels, but still the statistics here is not really convincing. It depends on a node set, ray access pattern, scene and everything.. So this decision is kinda arbitrary and i'm magnifying towards simplier to main code.

P.S. Before going into "but this gives XXX% speedup" discussion, i will state the obvious: we didn't even cover feature-selective CPU/CUDA compilation here. There will be quite some conflicting decisions to keep optimal performance ;) So there is no way which doesn't require compromise here..

First of all, I'm not the one who will complain here as I'm able to update my branch and get this 2x speedup for mircodisplacement whatever the foundation do of this patch.
Regarding the not personal aspects of this patch:

  1. Code maintainability: I find those if in svm.h don't really make the code harder to maintain, because it's not an area that will change a lot and it's still simple to read.
  2. Kernel recompilation: Not really a problem because this patch brings enough performance boost to make the 30sec recompilation still beneficial in most cases. Only drawback I see is that it's going to take more place on the HDD, but I don't think it's a real issue these days. So what is unacceptable? Most people on BA install a new buildbot every day and this alone trigger a recompilation of all the kernels... so it's better for them if the recompilation is faster and brings more performance by adapting to the scene.

When i was talking about code maintainability i was mainly reffering to idea of keeping both SoA/AoS. This is already quite painful. And more annoyingly, it is totally unpredictable for artists. For development it's also quite PITA: for example with SoA it's next to impossible to come with a clear code for things like D2249.

The issue with svm.h is that your current optimization is kinda aiming particular file. It is impossible to pre-calculate any order of nodes which will speed up all scenes. It is likely nodes will be added (also, we've got similar things with closures), so maintaining optimal performance in the case when compiler just converts it to bunch of if-else statements is not nice or easy.

Kernel recompilation is a problem for modellers/shading/lighting guys. They will never accept delay in the viewport. As you mention, there are people who updates blender quite often. This means, they will keep latency in viewport update every single day. Surely it is possible to compile all the features in for viewport, but then it's also no good for artists because then they will never know whether their shader network will be fast or slow in the final render.

As for speeding up kernel compilation: the only thing we can do is to multi-thread it. Then assuming you can compile in 11 threads you'll have latency of a single compiled kernel. Unfortunately, this is already far too much (on a certain hardware configuration it takes up to 15sec) for the viewport. We can't speed up single kernel compilation because we don't develop drivers.

I would rather stop increasing cases when we demand artists to wait for kernels to be recompiled, it's never gonna be friendly.


Since i'm not the only developer in the area, there might be others who are really up to such grained feature-compilation. But then before really continuing coming this route let's apply it to all the devices, especially CUDA. That will make competition between CUDA and OpenCL much-much more interesting (doing feature-adaptive CUDA gives really reasonable improvements to speed an VRAM usage ;)

I don't have a lot of time right now, but I would like to understand better where the speed differences are coming from and hopefully figure out a way to keep them with as little code complexity as possible.

If microdisplacement can be 90% faster that's definitely interesting to try to keep. But we should understand better why it happens, if it's the SoA/AoS things, some specific SVM nodes that when left out make things much faster, transparent shadow code, or some combination. So far I have not been able to redo such a big speedup on Windows but did not test much yet.

It might be that there is still the problem that AMD can't generate jump tables and so must you must to up to 77 comparisons for all nodes, and moving some nodes to another place in the list or disabling nodes might be giving a speedup? And if that's a big part of it we could try to use binary search.

Regarding kernel recompilation: Why not add a "feature-adaptive compilation" option in the Performance tab? When you need fast viewport previews, you can disable it and live with the lost speedup, and when you do the final F12 renders, you can activate it because a few sec of kernel compilation don't matter for hour-long renders.
To be precise, I wouldn't completely disable adaptive compilation - options like "have volume" and "have hair" don't change that often. But especially the SVM node adaptiveness could easily be an option.

Oh, and a note regarding multithreaded OpenCL: Sadly, it's not as easy as it sounds. I have a patch that supports multithreaded compilation, but all OpenCL frameworks (Intel, NVidia, AMD) serialize the compilation internally. I've found a way around this - add a function to the Cycles Python module that creates a context, compiles the kernel and dumps it to a .cubin, and then have each compilation thread launch <own binary> -b --python-expr "import _cycles; _cycles.opencl_compile(<options>)". Of course, that has some limitations, but it could be supported for Cycles Standalone as well.

With AoS this microd.blend file is 2% faster to render (so something like 95% faster than master). If SoA is such a hassle, kick it :) It brings 10% performance boost in some scene, but as you say, it's unpredictable.

I'll try in the coming days to progressively activate features in this scene to see the impact on performance. For this, Lukas's patch will be very helpful :)

It might be that there is still the problem that AMD can't generate jump tables and so must you must to up to 77 comparisons for all nodes, and moving some nodes to another place in the list or disabling nodes might be giving a speedup? And if that's a big part of it we could try to use binary search.

Note sure to understand what you meant there, but if there is a way to automate what I manually did (add and remove nodes and recompile to see there impact on rendering speed) I'm for it. It will save a good amount of time to the devs and may help to find best combination for each configuration. Would you do the binary search on runtime or would you make some profiling and harcode the best organization?

I had the occasion to test on a 280X today, it rendered the microd.blend file with scale factor of 2 for adaptive subdivision (default scale of 1 made an out of memory error). Render times where the following:
2min07 on master, 1min52 on patched blender.
What is really surprising is that with the exact same setting of 2 for the scale, the RX 480 renders in 2min13 with the patched version. Which means that the 280x is about 19% faster than the RX480 when rendering with micro displacement even with the patch. Without the patch, it's more like 2x faster than the RX480. On all other benchmark files, the RX 480 is about 20-40% faster than the 280X.
I think it means that this patch prevent a slowdown by reducing the pressure on registers. It would be great to have a direct contact with AMD on this. It may be a bug in the compiler that can be fixed upstream.
Does the 280X have more registers, thus supporting more inlining?

Another point is that I see similar gains on older architecture like Brecht with this patch. More subtle than on the RX480, but still pretty good with 6 to 14% in my test using AoS.

Little update. Hawaii architectures based cards also greatly benefit from this patch with impressive speedups: See https://blenderartists.org/forum/showthread.php?400121-AMD-RX-480-with-8-Gigs-of-Vram-at-229&p=3100843&viewfull=1#post3100843 .
I did some test to see which nodes trigger a big slowdown on RX480. On the microd.blend at 64spp with one tile I get this rendering time (without scene preparation):

Settingsrender timespeed impact
original file14.31sec rendering (23.76s , 9.45 of which are scene preparation)
lightpath16.53s-15%
lightpath + RGBRamp+transparentBSDF+Vector curves19.04sec-33%
lightpath + RGBRamp+transparentBSDF+Vector curves+Holdout20.08sec-40%
lightpath + RGBRamp+transparentBSDF+Vector curves+Holdout+Wireframe20.78sec-45%
lightpath + RGBRamp+transparentBSDF+Vector curves+Holdout+Wireframe+Blackbody+RGB_mix21.80sec-52%

Normally, this last combination should make the build kernel the same as in master, but the performance is better than in master despite the same features being compiled. Or I missed something maybe. Most nodes added alone have a very little impact on performance. Only Lightpath, wireframe and Blackbody had a big impact when added alone. Lightpath is used pretty often, but Wireframe and Blackbody aren't, so it may be a good idea to make them features even if the whole patch is not accepted.

Is there a more direct way to contact AMD than their report form? Is the programmer paid by AMD already known?

Thanks for the more detailed tests. That's really odd, I wouldn't expect enabling those shader nodes to make incrementally slower like that. I can imagine a single shader node causing some specific issue, or enabling a lot of them to add some slowdown. But not all those different ones with such big differences for each.

I would hope that things like register spilling are determined by the biggest node, and that adding more smaller nodes would not make it worse. There is the extra cost of possibly one extra if/else comparison if the switch is not optimized with a jump table, but that should not be so high. So I wonder what the compiler is doing here.

I don't think it's been announced yet who will do OpenCL split kernel work funded by AMD. Hopefully once that gets going there will be some time to gets this kind of thing figured out together with AMD devs.

Is it register spill or an if-else hell? Did anyone try to do binary search?

Think question was raised before, binary search is something like:

if(node < NODE_MATH) {
  if(node < NODE_MIX) {
    /* handle nodes from NODE_CLOSURE_BSDF .to NODE_MIX */
  }
  else {
    /* handle nodes from NODE_MIX to NODE_MATH */
  }
}
else {
  /* similar tricks */
}

When there's handful of nodes remained after bisecting, we can roll back to switch() statement. Would be an interesting experiment anyway.

Is it register spill or an if-else hell? Did anyone try to do binary search?
Think question was raised before, binary search is something like:

if(node < NODE_MATH) {
  if(node < NODE_MIX) {
    /* handle nodes from NODE_CLOSURE_BSDF .to NODE_MIX */
  }
  else {
    /* handle nodes from NODE_MIX to NODE_MATH */
  }
}
else {
  /* similar tricks */
}

When there's handful of nodes remained after bisecting, we can roll back to switch() statement. Would be an interesting experiment anyway.

@Sergey Sharybin (sergey) If you write it, I can test and report. I have no time at the moment for programming and I'm still learning how cycles work.

@Brecht Van Lommel (brecht) adding some nodes even reduce render times. That's why some nodes like light falloff and ambient occlusion in this patch are "bundled" with bump for example. It would definitely be great to have contact with the team working on the compiler. They opened their Linux driver to a large extent, maybe the OpenCL compiler is also available to help understand what is happening?

I used this script

with my patch to get render times with different nodes. Because combinations also have an impact, it makes hard to test every possible combinations (with 84 nodes, we would have to test 2^84 combinations). So I just tested adding single nodes to see there impact when rendering the default cube at 1024spp.
Workflow is to start patched D2254 blender, choose cycles as render engine, set samples to 1, run the script once to compile all kernels. Then render a second time with 1024spp.
It is best to do it with D2264 applied to speedup kernel compilation.
Problem at the moment is that the best method to get pure render time is to get them from the console output, which isn't really easy. So if someone has a solution to expose render time to the python API through an operator or a custom property or whatever, it would really help analyze the times.

I tested with a Radeon RX 480 today, on Windows 10 with Radeon driver version 16.9.2 (same as you used). Using the attached microd.blend file:

  • Your blenderartists build: 2m39s
  • My build without patch: 2m28s
  • My build with patch: 2m19s

I currently don't have good ideas for why you saw such a big speedup, while it's much smaller here with a very similar configuration.

I tested with a Radeon RX 480 today, on Windows 10 with Radeon driver version 16.9.2 (same as you used). Using the attached microd.blend file:

  • Your blenderartists build: 2m39s
  • My build without patch: 2m28s
  • My build with patch: 2m19s

I currently don't have good ideas for why you saw such a big speedup, while it's much smaller here with a very similar configuration.

Hi Brecht. Now that's weird. Did you test with buildbots and/or 2.78 release? I don't have win10 so I tried with Linux to see if it could come from Win7. Wiht AMD-GPUPro 16.3 I get:
4min02 with buildbot 3e460b6 and file's default tile size
3min55 with buildbot 3e460b6 and one tile
2min09 with master patched with D2254 and one tile.




Maybe your build has some patches or win10 has a more advanced OpenCL compiler? Sometime one installer has different version of the driver/compiler depending on the system it's launched on.

So with the RX 480 on Linux I can confirm a big speedup with this patch, driver amdgpu-pro 16.30.3.

2.784:28.92
master4:27.73
master + patch2:31.54

Since it renders in similar time without the patch on Windows, I guess there is a difference in the OpenCL compiler. I'll do some testing to better understand this performance behavior.

2 good news :) many users on BA and I were not dreaming and AMD fixed the problem in it's latest compiler. Now it would be good if they could ship this new compiler for all platform. Could we have access to some dev drivers through Mai? Or ask during the Bconf as a dev from AMD will be here. It would be good if people working on the OpenCL kernel could have a direct email adress to stop guessing for what is on AMD's side.
Anyway, this patch gives good performance improvements on all cards and platform. Around the 5-15% in my case if we remove the compiler bug.

Hi,

There is a simple explanation for this performance boost.

The big switch / case in svm.h is very inefficient and should be replaced by an array of functions.

As this part of the code is called millions of times, millions of comparisons can be avoided.

bliblu bli (bliblubli) has mitigated this problem by placing the most used nodes at the top of the list.

One more note:
Some compilers may optimise the switch / case, some other may not.
I think it's up to devs to optimize this code in order to get predictable results.

@deom damien (dams), the big switch might affect performance, but at best it explains a few % of the performance improvements from this patch. From my initial tests it is really the contents of some node functions that seems to be slowing things down.

On the CPU it's quite standard for the compiler to generate a jump table, and CUDA seems to handle it fine as well. I think this is really the job of the AMD compiler that we shouldn't have to handle manually, but if there's no other way we can add a workaround.

OpenCL does not support function pointers, so we can't use that. As mentioned above we could try do a binary search using if/else, that would give us log2(77) ≈ 6 comparisons per node. Someone would probably have to write a script to generate that code because it's tedious and error prone to write it manually.

Hi,
There is a simple explanation for this performance boost.
The big switch / case in svm.h is very inefficient and should be replaced by an array of functions.
As this part of the code is called millions of times, millions of comparisons can be avoided.
bliblu bli (bliblubli) has mitigated this problem by placing the most used nodes at the top of the list.

If you have a patch to see what you mean with a good performance boost, please upload it on the tracker. This patch just optimize the already chosen code architecture. My goals were:

  • to have a very fast group0 (minimal kernel): in this case, used for clay renders for example, I get a 60% performance boost over master.
  • to avoid having to many recompile until the multi-threading compilation is commited. So even though I could have gotten a bit more performance boost, I only kept the biggest ones.
  • Have good gains in production shots, which is also the case regarding the new times on the official benchmark set (up to 20% depending on SoA or AoS).

By the way, speaking of production shots, please render them with the supported kernel. Most are defaulted to experimental although it's not required anymore in current master. It makes SOA active which degrades performance on all scenes but classroom and barcelona.

mathieu menuet (bliblubli) removed rC Cycles as the repository for this revision.
mathieu menuet (bliblubli) updated this revision to Diff 7683.

switched SoA to supported kernel and AoS to experimental. AoS is faster in most cases and benchmark files are set to experimental by default. Point is to gather better time results.

@Brecht Van Lommel (brecht) Van Lommel (brecht) added a comment.
@deom damien (dams), the big switch might affect performance, but at best it explains a few % of the performance improvements from this patch. From my initial tests it is really the contents of some node functions that seems to be slowing things down.

A good solution would then be to split this patch in two and test separatly

  • switch / case reorganisation
  • fast group() addition

@deom damien (dams) The performance boost in production shots definitely comes from the new node_features because all the production shots use the group level 3. I personally didn't notice any impact on performance when just moving cases around. The new organization of the groups is to allow optimal performances in this groups while trying to follow artists usages:

  • Group 0: clay render = minimal kernel plus the nodes that don't degrade performance.
  • Group 1: add textures, fresnel and lightpath (because they are often used to mix closures), hairs
  • Group 2: add randomness through object and particle info
  • Group 3: the rest with some features for the seldom used nodes like holdout and the nodes with big impact on performance like blackbody and wireframe.

You can see the individual benefit of each added feature by using this script: https://dev-files.blender.org/file/data/oec5d2vjxnn4ixtpoz5d/PHID-FILE-jiqhz6sznnfjmujw4wkd/test_single_nodes.py
Or manually add the nodes that are selectively compiled and compare times. They don't need to be connected. It is faster than doing it in the code and recompile Blender every time.

@bliblu bli (bliblubli)

@deom damien (dams) The performance boost in production shots definitely comes from the new node_features because all the production shots use the group level 3. I personally didn't notice any impact on performance when just moving cases around. The new organization of the groups is to allow optimal performances in this groups while trying to follow artists usages:

You say in one sentence that it does not impact performance, but in the next one we get the best performance in different situations ...
I just think that to identify the cause (or causes) of this performance boost, it might be helpful to split this patch into smaller parts.

Hope this helps,

Blender.org master download 64bit (not VS2015 compile)

bliblubli Opencl BETA build from his thread

SVN Master Built by me VS2015 today No bliblubli Patch

Specs:

Windows 10 Pro 64bit, All latest patches
Intel 2.8Ghz i5, 12 GB DDR3 Dual channel
AMD FirePro W9100 16GB GDDR5
AMD drivers ver: FirePro Software Version
16.2 (Latest non beta release drivers)

Let me know if any other testing is needed to help track down issues on this.

Sorry, Also this is not micro displace. Experimental turned off.

Multi res mod, Displace mod, modifiers not applied. 150 Samples

Sorry, Also this is not micro displace. Experimental turned off.
Multi res mod, Displace mod, modifiers not applied. 150 Samples

@3D Luver: Thanks for the times. It would be better if we could have the file to compare times on different architectures if it's ok for you. It would also allow me to see which nodes are causing the speed loss in master although you use standard displacement.

@deom damien (dams) Sorry for my english. I meant that all benchmark files use group3, so all groups are compiled. Which means it's not important in this case which node is in which group (at least my test show no difference in performance due to the ordering when group3 is used). Only the node_features have a performance impact on RX480.
For non benchmark files, if you render a scene with nodes from only the group0, the reorganization will have an impact, because all the nodes from group1, 2 and 3 will not get compiled.
I'll split the patch when I have time.

@blibblubli

I might be able to if you agree not to share or distribute or use my setups for commercial use, as all these terrains are to be sold as Asset packs with procedural terrain shaders.

I droped you a PM on Blender artist, If we can work out some protection for me ill give you a link to the file.

Also as im new to Blender code, Im having a nightmare try to apply patches to Blender code on windows, Ive tread command line patch p1 and p2, also tried tortoise svn which also doesn't work. How the hell do I patch code Lol

@blibblubli
I might be able to if you agree not to share or distribute or use my setups for commercial use, as all these terrains are to be sold as Asset packs with procedural terrain shaders.
I droped you a PM on Blender artist, If we can work out some protection for me ill give you a link to the file.
Also as im new to Blender code, Im having a nightmare try to apply patches to Blender code on windows, Ive tread command line patch p1 and p2, also tried tortoise svn which also doesn't work. How the hell do I patch code Lol

Of course I'll not sale things I get for programing purpose. I never check my pm on BA, sorry, will have a look.
Regarding the patch, it's written right under reviewers list at top of this page: "arc patch D2254" from you source directory. For how to install and configure arcanist: https://wiki.blender.org/index.php/Dev:Doc/Tools/Code_Review

Hey, Hope this patch is still progressing.

I've updated my build to master 2-3 hours ago and even with the changes made on adaptive sampling etc etc, This patch still kills everything in sight for performance increase on Win 10 Opencl GPU.

Master 21 mins, your patch 10 mins. Some times people don't need to know the how and when, Just the outcome.

In 1928 Alexander Fleming discovered Penicillin, He came back from holiday to find some mold on his tests. When looking closer he saw how the bacteria wasn't growing around where the mold was, and accidentally found the antibiotic Penicillin. Some times you just need to accept the good luck in life. Ill take the speed up thanks, Brecht and the code top boys can try to find out WHY in the long run.

Just commit the patch Guys :)

Splitted the patch in 3 parts with D2339, D2340 and D2341.