Home

Activity Feed

Today

kyriazis (George Kyriazis) added a comment to D1200: Cycles OpenCL kernel-splitting work.
In D1200#22424, @brecht wrote:

thanks for the comment. Hmm, if you split the kernel into a few (3-4) kernels because of svm_eval_nodes(), don't you end up needing to execute all of them anyway, because you have no guarantee that all the nodes you need will be in the same kernel?

For example, if you split into 3 kernels as follows: kernel 1 has cases ABC, kernel 2 cases DEF and kernel 3 cases GHI, then what happens if your shader requires nodes ADEG? You'll end up needing to go through all 3 kernels. (is that what you mean by splitting?)

I meant one kernel that has ABC, one that has ABCDEF, and one that has ABCDEFGHI, where GHI would be the nodes that requires the most registers, etc.

I was thinking that if the number of possible kernels that ever need to be compiled is some reasonable fixed number, then they could all be compiled and cache once, so compile time is not as a big a concern. But thinking about it more, there's probably other #defines influencing the SVM nodes too, so the number may be too big to cache it all beforehand.

Tue, Mar 31, 12:57 AM · Cycles
kyriazis (George Kyriazis) added a comment to D1200: Cycles OpenCL kernel-splitting work.
In D1200#22421, @brecht wrote:

Here's the issue: svm_eval_nodes() is a huge case statement. This translates into a huge if-then-else statement. the optimizer is trying to optimize the whole thing in one batch, causing compilation to be slow. In addition, the generated code is "less than optimal", causing register spills, etc.. By limiting the number of cases for the switch statement, we are making the compiler's life easier, and you end up not going through a sequence of if-then-if-then-if-then-.. too often.

That's interesting, I thought the compiler would be able to do similar optimizations as on the CPU, something like a jump table. I guess one way to speed it up would be to do a kind of binary search with nested if / else, that at least makes it logarithmic, though that's not ideal yet.

Tue, Mar 31, 12:52 AM · Cycles
campbellbarton (Campbell Barton) committed rB43b20d5b6c08: minor cleanup (authored by campbellbarton (Campbell Barton)).
minor cleanup
Tue, Mar 31, 12:50 AM
campbellbarton (Campbell Barton) committed rB3121619a7479: Double the strength of length-weight smoothing (authored by campbellbarton (Campbell Barton)).
Double the strength of length-weight smoothing
Tue, Mar 31, 12:48 AM
campbellbarton (Campbell Barton) committed rB0420384283e4: Optimize simple smoothing, approx 60% faster (authored by campbellbarton (Campbell Barton)).
Optimize simple smoothing, approx 60% faster
Tue, Mar 31, 12:41 AM
campbellbarton (Campbell Barton) committed rBf95d73fb5221: Comment tweaks (authored by campbellbarton (Campbell Barton)).
Comment tweaks
Tue, Mar 31, 12:41 AM
brecht (Brecht Van Lommel) added a comment to D1200: Cycles OpenCL kernel-splitting work.

thanks for the comment. Hmm, if you split the kernel into a few (3-4) kernels because of svm_eval_nodes(), don't you end up needing to execute all of them anyway, because you have no guarantee that all the nodes you need will be in the same kernel?

For example, if you split into 3 kernels as follows: kernel 1 has cases ABC, kernel 2 cases DEF and kernel 3 cases GHI, then what happens if your shader requires nodes ADEG? You'll end up needing to go through all 3 kernels. (is that what you mean by splitting?)

Tue, Mar 31, 12:37 AM · Cycles
kyriazis (George Kyriazis) added a comment to D1200: Cycles OpenCL kernel-splitting work.
In D1200#22419, @brecht wrote:

Regarding SVM nodes, my experience with CUDA was that for run time, it's only the "biggest" node that matters. Having many nodes isn't really a problem by itself, it's not going to make the code more divergent. But the biggest node determines how many registers you need, and how much data needs to be moved between registers and global memory, and that has an effect on code that you typically wouldn't think is affected.

That happens in general, if you have multiple types of geometric primitives, multiples types of samplers, .. you can add new ones without slowing down the kernel as long as their code is smaller than code for all of the existing ones.

So if the kernels are cached and keeping the amount of nodes small to reduce compile time is not the concern, then perhaps it makes sense make a few groups of nodes according to their size, instead of doing it for individual node types. Then you could have maybe 3 or 4 cached kernels which isn't too big to store on disk. I'm not sure if it's easy to find those groups, or if that would actually help with AMD OpenCL.

To avoid duplicate shader globals member definitions, the same ugly trick as done with kernel_textures.h can be used, make a file that declares all the members and then include it in various places with different macro definitions.

Tue, Mar 31, 12:32 AM · Cycles
sergey (Sergey Sharybin) added a comment to D1200: Cycles OpenCL kernel-splitting work.

@brecht, nice to hear from you!

Tue, Mar 31, 12:27 AM · Cycles
brecht (Brecht Van Lommel) added a comment to D1200: Cycles OpenCL kernel-splitting work.

Here's the issue: svm_eval_nodes() is a huge case statement. This translates into a huge if-then-else statement. the optimizer is trying to optimize the whole thing in one batch, causing compilation to be slow. In addition, the generated code is "less than optimal", causing register spills, etc.. By limiting the number of cases for the switch statement, we are making the compiler's life easier, and you end up not going through a sequence of if-then-if-then-if-then-.. too often.

Tue, Mar 31, 12:26 AM · Cycles
kyriazis (George Kyriazis) added a comment to D1200: Cycles OpenCL kernel-splitting work.
In D1200#22417, @sergey wrote:

Viewport is actually quite the same as final render set to "Progressive Refine" and that option is something artists want to have. If it's supported for final renders then viewport should be a piece of cake to support as well. So think rough plan in this context is something like:

  • Look into Progressive Refine for final renders
  • Once that work, look into what's remained to do for the viewport
  • Look into reducing delays

    Out of curiosity: what's the time difference bewteen kernel compiled with only needed nodes and kernel compiled with all the nodes? (i don't have AMD cards handy, and afraid intel CPU and nvidia's opencl are not really best things for benchmarks at this point).
Tue, Mar 31, 12:21 AM · Cycles
brecht (Brecht Van Lommel) added a comment to D1200: Cycles OpenCL kernel-splitting work.

Regarding SVM nodes, my experience with CUDA was that for run time, it's only the "biggest" node that matters. Having many nodes isn't really a problem by itself, it's not going to make the code more divergent. But the biggest node determines how many registers you need, and how much data needs to be moved between registers and global memory, and that has an effect on code that you typically wouldn't think is affected.

Tue, Mar 31, 12:18 AM · Cycles
lordloki (Jorge Bernal) added a comment to T41825: Effects in bge.

We can collaborate in both.

Tue, Mar 31, 12:13 AM · Game Engine
campbellbarton (Campbell Barton) committed rB71a75433eadc: remove asm, committed by accident (authored by campbellbarton (Campbell Barton)).
remove asm, committed by accident
Tue, Mar 31, 12:11 AM
campbellbarton (Campbell Barton) committed rBb5496c09c2b9: Minor edits, preparing for master (authored by campbellbarton (Campbell Barton)).
Minor edits, preparing for master
Tue, Mar 31, 12:09 AM
campbellbarton (Campbell Barton) committed rBfff8db369220: Show message when bind data is missing (authored by campbellbarton (Campbell Barton)).
Show message when bind data is missing
Tue, Mar 31, 12:09 AM

Yesterday

sergey (Sergey Sharybin) added a comment to D1200: Cycles OpenCL kernel-splitting work.

Viewport is actually quite the same as final render set to "Progressive Refine" and that option is something artists want to have. If it's supported for final renders then viewport should be a piece of cake to support as well. So think rough plan in this context is something like:

Mon, Mar 30, 11:46 PM · Cycles
lordloki (Jorge Bernal) closed T44192: Sensor acting as previous selected Sensor type as "Invalid".

This is not a bug.

Mon, Mar 30, 11:38 PM · Game Logic
kyriazis (George Kyriazis) added a comment to D1200: Cycles OpenCL kernel-splitting work.
In D1200#22413, @sergey wrote:

some comments here, and Lenny will reply on some other issues in more detail

In D1200#22394, @sergey wrote:

Made some preliminary check of the new code. There are some major issues which are to be addressed.

  • You totally violated data flow in rendering: kernels are to be loaded before doing device_update(), otherwise it's just asking for huge problems (i.e. memory allocation issues due to missing modules and so). I'm not sure why OpenCL is happy with this, but it's just crashing CUDA.

Is that the one that you said to ignore in a later comment?

No, the thing i said to ignore was about forced #undefs. The device_update() issue is still there. It might be rather simple solution: make it an utility function in ShaderMaager which will either give you all node types used or directly fill them into given device.

I'm still trying to find a better way to communicate such features to the device, but didn't came with anything better looking.

  • The things you're doing with nodes (selective SVM node compilation) is fine for final renders, but for viewport it'll need to be like a conjunction of all nodes ever used during the rendered viewport. This is because you don't want kernels to be re-{loaded,compiled} when you're tweaking shader tree.

Hmm.. The reason we are doing this is to simplify the kernel (similar stuff may need to be done inside the bsdf function) and allow for faster code and faster compilation. In your mind, if we were to recompile on-the-fly, what is the maximum delay (in seconds) that is acceptable between someone changing the material, and the preview starting?

The goal should be 0.0sec of delay :) It's a bit of research topic perhaps, but i think it worth trying to make it so once node was added to device it never gets removed during the BlenderSession. This way you can safely tweak all the connections in the shader tree without triggering kernel to be re-loaded.

On the other hand, if latency caused by kernel re-load (compiled kernel reload, let's keep aside compilation time, after some ours of work artists will have all sort of kernels compiled anyway :) is not measurable by human current code will work too.

Perhaps we'd better concentrate on the viewport render first and solve latency after that (when latency is visible it's easier to address it anyway).

How about allowing the option to use OpenCL for final rendering, while still going down the CPU path for preview?

Mon, Mar 30, 11:27 PM · Cycles
campbellbarton (Campbell Barton) committed rBff9834a951bf: Add in support for calculating the base-coords from the editmesh (authored by campbellbarton (Campbell Barton)).
Add in support for calculating the base-coords from the editmesh
Mon, Mar 30, 11:22 PM
sergey (Sergey Sharybin) updated the diff for D1208: Cycles: Point density texture experiment.

Fixed issues with RNA API

Mon, Mar 30, 11:20 PM
sergey (Sergey Sharybin) added a comment to D1200: Cycles OpenCL kernel-splitting work.

some comments here, and Lenny will reply on some other issues in more detail

In D1200#22394, @sergey wrote:

Made some preliminary check of the new code. There are some major issues which are to be addressed.

  • You totally violated data flow in rendering: kernels are to be loaded before doing device_update(), otherwise it's just asking for huge problems (i.e. memory allocation issues due to missing modules and so). I'm not sure why OpenCL is happy with this, but it's just crashing CUDA.

Is that the one that you said to ignore in a later comment?

Mon, Mar 30, 11:15 PM · Cycles
GiantCowFIlms updated subscribers of D1197: Add Equals To mode to all math nodes.
Mon, Mar 30, 11:14 PM
campbellbarton (Campbell Barton) updated subscribers of T44197: Cycles OpenCL kernel-splitting work.

@bliblubli, This thread is not for you to make value judgment on our work and tell us what you think our priorities should be. This task was set up so users could report how well the patch works.

Mon, Mar 30, 11:10 PM · Cycles, BF Blender
ben2610 (Benoit Bolsee) committed rB56283d0af8f1: BGE: setUniformEyef need automatic refresh on every frame. (authored by ben2610 (Benoit Bolsee)).
BGE: setUniformEyef need automatic refresh on every frame.
Mon, Mar 30, 11:07 PM