Cycles/SVM Stack optimization.
Open, NormalPublic

Description

SVM currently just picks random available sources in the graph which results in it taking an rather unpredictable path though the graph leading to sub optimal stack usage.

this patch uses nodegroups as a heuristic to generate the topological order for svm, nodegroups will not be entered until all the inputs to the group have been satisfied.
by doing this we do not have to solve the optimal path though the graph in regards to stack usage (NP hard problem) and still get decent results on some graphs.



Results:
MandelBrot_97.blend goes from just over 256 stack slots at peak pressure to 18.
mandelbrot_16k_nodes.blend , not sure how much it would use, but it goes from un-runnable to < 50 slots

Pro's to this method:
Works wonders on large graphs using nodegroups

Con's
Does nothing to help 'flat' graphs, i've gone though most mentions of svm stack issues in the database and most of them have a classic tree with splitting branches layout, we could probably get those running by switching from randomly picking from the available sources to a depth first search approach.

Details

Type
Patch
LazyDodo (LazyDodo) updated the task description. (Show Details)
LazyDodo (LazyDodo) raised the priority of this task from to Needs Triage.
LazyDodo (LazyDodo) set Type to Patch.

Interesting. Is there any difference in render performance? GPU might benefit from this?

Sergey Sharybin (sergey) triaged this task as Normal priority.Nov 26 2015, 10:43 AM

In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit. Also, unless i'm mistaken, it wouldn't help that much in cases when groups themselves are connected in a tree-looking structure (as an opposite to more linear structure in the attached file).

Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well.

Interesting. Is there any difference in render performance? GPU might benefit from this?

No change in the graphs i have tested, the workload hasn't changed just the order it gets executed in so wasn't expecting any perf benefits to be honest.

Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well.

I agree that ordering things before compile is a good thing to do, problem is, there is no proven algorithm to quickly solve for minimal stack usage, so we're down to coming up with a good heuristic for large graphs, for smaller graphs you can brute force your way though but then again, smaller graphs are unlikely to run into stack issues to start with.

I thought about using Kernighan–Lin to partiion the graph in smaller workable clusters but, figured why am i trying to coming up with a new partitioning scheme when i'm handing svm a good partition on silver platter (nodegroups) so i tried that first.

Little Background on how i got here, i had no idea how/what to contribute, noticed a bunch of the texture generation algo's (wood etc) were not available in cycles, so figured that be a nice start, then bumped into the ticket by the voronoy guy who got told 'we can't add nodes forever' so plan b: 'look into improving the overall infrastructure so we can do things like that with just nodes' (so we can still have nice things without having to extend the cycles kernels). I'm fairly sure that normal shader graphs will get very little perf benefits from it (although they speedup they got from the find_deps changes will be noticable for them) but i hope it'll work wonders for the procedural guys out there. Started with a mandelbrot cause well it's super easy to implement.

In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit.

That may be true for artists working on a final shader for rendering, but for creating procedural textures and other complexities to be used by those artists, node groups are essential. For procedurals alone, this patch is a godsend.

@LazyDodo (LazyDodo), i don't really see any difference with and without the patch. Even after applying the patch MandelBrot_97.blend runs out of SVM stack here.

I build this patch before the constant folding stuff hit master, Don't have time right now to figure out why, but if you take out the folding in ShaderGraph::clean it'll work as intended.

In all honesty, the more i look at this, the more i want add_nodes in blender_shader.cpp not the flatten the graph (Have a NodeGroupNode type for svm), it's *A LOT* easier to solve many tiny problems regarding stack usage, trying to solve the problem for a graph with 1000's of nodes, not really doable in reasonable time.

Also once SVM knows about nodegroups, it'll probably be fairly easy to add a looping construct to svm and get some procedural stuff going without having to revert to osl on the cpu.

This is the direction i plan to explore , unless you guys go 'oh hell no! we'll never merge that!' ?

This could be caused by the re-linking and storing group ID in the socket.. It'll be more clear to store group in the node i think. Could also become more generic execution_group_id which could be set by either sync code based on the node gorup or some other heuristic applied for the flat shader.

Function calls are not magic bullet either, it is just delaying some problems to a new level and adding other things to be solved:

  • Order of function calls (it's straightforward in the attached file due to more or less linear nature of the shader tree, in real files there are much more branches)
  • Requires smart "when to inline policy". You can't always do a function call when entering a node group, it'll slow things down for existing shaders from open movies, for example.

All in all, i'm not convinced complexity in the compiler and kernel to support function calls worth the trouble actually. This doesn't mean i don't think adding exection_group_id wouldn't help in optimizing the graph.

Yeah , while it's a nice proof of concept i don't like the way it works, it complicates things, switching to a depth first search (to deal with the branching tree layout like the one in T45628) or any other smarter graph segmenting method would be nearly impossible with this extra meta data attached to nodes.

I'm not suggesting function calls in SVM. What this patch really does is insert a 'virtual' node that has all the inputs of the nodegroup, so generate_svm_nodes doesn't enter the 'virtual' nodegroup until all inputs are satisfied. What i'm suggesting is stop faking it and just feeding generate_svm_nodes an actual nodegroup node. and when generate_svm_nodes encounters it as ready emit bytecode for the whole block in one go. If someone calls the same nodegroup 20 times in a row (or many more like the mandelbrot) it should still generate 20x the +-same bytecode for svm, not 1 block and a bunch of calls to it.

Doing it this way prevents generate_svm_nodes from executing a partial node group and starting work on some other nodes/groups, leaving a messy stack in the process. Take a look at the mandelbrot_1 nodegroup in mandelbrot_97, the iteration counter is pretty much *ALWAYS* ready, so it's really easy for generate_svm_nodes to keep executing it, however it's output is always used later on, so every time it runs, it occupies a spot on the stack until we're out of stack space naturally. This patch attempts to stop that behavior by adding an entry barrier.

What i'm saying is: I like the barrier, i don't like the way it's implemented.

@LazyDodo (LazyDodo), could see how passing actual node group will help for node compilation, but it's actually just moving complexity from one side to another. Going your way would mean making it more difficult to perform other graph optimization -- like, constant fold or removing unconnected nodes and such.

Surely those parts of optimization could be switched to new actual-node-group paradigm, but this will also mean we wouldn't be able to actually instantiate shader group because different instanced might be optimized in a different manner (depending on their inputs, for example).

I'm also still inconviced we should do such optimizations based on just node tree groups. Surely this will be a good start, but we should implement much more generic way of dealing with the problem.

That being said, if you'll have a look into D1676 you'll see implementation of execution group concept i was talking before. Currently it's being filled in based on the group instances, but nobody stops you from implementing some smart clasterization algorithm for the planar graphs as well.