Cycles/SVM Stack optimization. #46872

Closed
opened 2015-11-25 18:26:07 +01:00 by Ray molenkamp · 21 comments
Member

SVM currently just picks random available sources in the graph which results in it taking an rather unpredictable path though the graph leading to sub optimal stack usage.

this patch uses nodegroups as a heuristic to generate the topological order for svm, nodegroups will not be entered until all the inputs to the group have been satisfied.
by doing this we do not have to solve the optimal path though the graph in regards to stack usage (NP hard problem) and still get decent results on some graphs.

NodegroupBarrier.diff
mandelbrot_16k_nodes.blend
MandelBrot_97.blend

Results:
MandelBrot_97.blend goes from just over 256 stack slots at peak pressure to 18.
mandelbrot_16k_nodes.blend , not sure how much it would use, but it goes from un-runnable to < 50 slots

Pro's to this method:
Works wonders on large graphs using nodegroups

Con's
Does nothing to help 'flat' graphs, i've gone though most mentions of svm stack issues in the database and most of them have a classic tree with splitting branches layout, we could probably get those running by switching from randomly picking from the available sources to a depth first search approach.

SVM currently just picks random available sources in the graph which results in it taking an rather unpredictable path though the graph leading to sub optimal stack usage. this patch uses nodegroups as a heuristic to generate the topological order for svm, nodegroups will not be entered until all the inputs to the group have been satisfied. by doing this we do not have to solve the optimal path though the graph in regards to stack usage (NP hard problem) and still get decent results on some graphs. [NodegroupBarrier.diff](https://archive.blender.org/developer/F259287/NodegroupBarrier.diff) [mandelbrot_16k_nodes.blend](https://archive.blender.org/developer/F259288/mandelbrot_16k_nodes.blend) [MandelBrot_97.blend](https://archive.blender.org/developer/F255030/MandelBrot_97.blend) Results: MandelBrot_97.blend goes from just over 256 stack slots at peak pressure to 18. mandelbrot_16k_nodes.blend , not sure how much it would use, but it goes from un-runnable to < 50 slots Pro's to this method: Works wonders on large graphs using nodegroups Con's Does nothing to help 'flat' graphs, i've gone though most mentions of svm stack issues in the database and most of them have a classic tree with splitting branches layout, we could probably get those running by switching from randomly picking from the available sources to a depth first search approach.
Author
Member

Changed status to: 'Open'

Changed status to: 'Open'
Author
Member

Added subscriber: @LazyDodo

Added subscriber: @LazyDodo
Member

Added subscriber: @Blendify

Added subscriber: @Blendify
Member

Added subscriber: @Sergey

Added subscriber: @Sergey

Added subscriber: @ThomasDinges

Added subscriber: @ThomasDinges

Interesting. Is there any difference in render performance? GPU might benefit from this?

Interesting. Is there any difference in render performance? GPU might benefit from this?

In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit. Also, unless i'm mistaken, it wouldn't help that much in cases when groups themselves are connected in a tree-looking structure (as an opposite to more linear structure in the attached file).

Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well.

In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit. Also, unless i'm mistaken, it wouldn't help that much in cases when groups themselves are connected in a tree-looking structure (as an opposite to more linear structure in the attached file). Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well.

Added subscriber: @brecht

Added subscriber: @brecht
Author
Member

In #46872#348436, @ThomasDinges wrote:
Interesting. Is there any difference in render performance? GPU might benefit from this?

No change in the graphs i have tested, the workload hasn't changed just the order it gets executed in so wasn't expecting any perf benefits to be honest.

Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well.

I agree that ordering things before compile is a good thing to do, problem is, there is no proven algorithm to quickly solve for minimal stack usage, so we're down to coming up with a good heuristic for large graphs, for smaller graphs you can brute force your way though but then again, smaller graphs are unlikely to run into stack issues to start with.

I thought about using Kernighan–Lin to partiion the graph in smaller workable clusters but, figured why am i trying to coming up with a new partitioning scheme when i'm handing svm a good partition on silver platter (nodegroups) so i tried that first.

Little Background on how i got here, i had no idea how/what to contribute, noticed a bunch of the texture generation algo's (wood etc) were not available in cycles, so figured that be a nice start, then bumped into the ticket by the voronoy guy who got told 'we can't add nodes forever' so plan b: 'look into improving the overall infrastructure so we can do things like that with just nodes' (so we can still have nice things without having to extend the cycles kernels). I'm fairly sure that normal shader graphs will get very little perf benefits from it (although they speedup they got from the find_deps changes will be noticable for them) but i hope it'll work wonders for the procedural guys out there. Started with a mandelbrot cause well it's super easy to implement.

> In #46872#348436, @ThomasDinges wrote: > Interesting. Is there any difference in render performance? GPU might benefit from this? No change in the graphs i have tested, the workload hasn't changed just the order it gets executed in so wasn't expecting any perf benefits to be honest. > Perhaps doing topological sort before compile and compiling nodes in that order will be a good improvement for the case demonstrated here and for production files as well. I agree that ordering things before compile is a good thing to do, problem is, there is no proven algorithm to quickly solve for minimal stack usage, so we're down to coming up with a good heuristic for large graphs, for smaller graphs you can brute force your way though but then again, smaller graphs are unlikely to run into stack issues to start with. I thought about using Kernighan–Lin to partiion the graph in smaller workable clusters but, figured why am i trying to coming up with a new partitioning scheme when i'm handing svm a good partition on silver platter (nodegroups) so i tried that first. Little Background on how i got here, i had no idea how/what to contribute, noticed a bunch of the texture generation algo's (wood etc) were not available in cycles, so figured that be a nice start, then bumped into the ticket by the voronoy guy who got told 'we can't add nodes forever' so plan b: 'look into improving the overall infrastructure so we can do things like that with just nodes' (so we can still have nice things without having to extend the cycles kernels). I'm fairly sure that normal shader graphs will get very little perf benefits from it (although they speedup they got from the find_deps changes will be noticable for them) but i hope it'll work wonders for the procedural guys out there. Started with a mandelbrot cause well it's super easy to implement.

Added subscriber: @SynaGl0w

Added subscriber: @SynaGl0w

In #46872#348440, @Sergey wrote:
In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit.

That may be true for artists working on a final shader for rendering, but for creating procedural textures and other complexities to be used by those artists, node groups are essential. For procedurals alone, this patch is a godsend.

> In #46872#348440, @Sergey wrote: > In my experience working on open movies, node groups aren't that heavily used actually, so for them such optimization wouldn't give much benefit. That may be true for artists working on a final shader for rendering, but for creating procedural textures and other complexities to be used by those artists, node groups are essential. For procedurals alone, this patch is a godsend.

@LazyDodo, i don't really see any difference with and without the patch. Even after applying the patch MandelBrot_97.blend runs out of SVM stack here.

@LazyDodo, i don't really see any difference with and without the patch. Even after applying the patch MandelBrot_97.blend runs out of SVM stack here.
Author
Member

I build this patch before the constant folding stuff hit master, Don't have time right now to figure out why, but if you take out the folding in ShaderGraph::clean it'll work as intended.

In all honesty, the more i look at this, the more i want add_nodes in blender_shader.cpp not the flatten the graph (Have a NodeGroupNode type for svm), it's A LOT easier to solve many tiny problems regarding stack usage, trying to solve the problem for a graph with 1000's of nodes, not really doable in reasonable time.

Also once SVM knows about nodegroups, it'll probably be fairly easy to add a looping construct to svm and get some procedural stuff going without having to revert to osl on the cpu.

This is the direction i plan to explore , unless you guys go 'oh hell no! we'll never merge that!' ?

I build this patch before the constant folding stuff hit master, Don't have time right now to figure out why, but if you take out the folding in ShaderGraph::clean it'll work as intended. In all honesty, the more i look at this, the more i want add_nodes in blender_shader.cpp not the flatten the graph (Have a NodeGroupNode type for svm), it's *A LOT* easier to solve many tiny problems regarding stack usage, trying to solve the problem for a graph with 1000's of nodes, not really doable in reasonable time. Also once SVM knows about nodegroups, it'll probably be fairly easy to add a looping construct to svm and get some procedural stuff going without having to revert to osl on the cpu. This is the direction i plan to explore , unless you guys go 'oh hell no! we'll never merge that!' ?

This could be caused by the re-linking and storing group ID in the socket.. It'll be more clear to store group in the node i think. Could also become more generic execution_group_id which could be set by either sync code based on the node gorup or some other heuristic applied for the flat shader.

Function calls are not magic bullet either, it is just delaying some problems to a new level and adding other things to be solved:

  • Order of function calls (it's straightforward in the attached file due to more or less linear nature of the shader tree, in real files there are much more branches)
  • Requires smart "when to inline policy". You can't always do a function call when entering a node group, it'll slow things down for existing shaders from open movies, for example.

All in all, i'm not convinced complexity in the compiler and kernel to support function calls worth the trouble actually. This doesn't mean i don't think adding exection_group_id wouldn't help in optimizing the graph.

This could be caused by the re-linking and storing group ID in the socket.. It'll be more clear to store group in the node i think. Could also become more generic `execution_group_id` which could be set by either sync code based on the node gorup or some other heuristic applied for the flat shader. Function calls are not magic bullet either, it is just delaying some problems to a new level and adding other things to be solved: - Order of function calls (it's straightforward in the attached file due to more or less linear nature of the shader tree, in real files there are much more branches) - Requires smart "when to inline policy". You can't always do a function call when entering a node group, it'll slow things down for existing shaders from open movies, for example. All in all, i'm not convinced complexity in the compiler and kernel to support function calls worth the trouble actually. This doesn't mean i don't think adding `exection_group_id` wouldn't help in optimizing the graph.
Author
Member

Yeah , while it's a nice proof of concept i don't like the way it works, it complicates things, switching to a depth first search (to deal with the branching tree layout like the one in #45628) or any other smarter graph segmenting method would be nearly impossible with this extra meta data attached to nodes.

I'm not suggesting function calls in SVM. What this patch really does is insert a 'virtual' node that has all the inputs of the nodegroup, so generate_svm_nodes doesn't enter the 'virtual' nodegroup until all inputs are satisfied. What i'm suggesting is stop faking it and just feeding generate_svm_nodes an actual nodegroup node. and when generate_svm_nodes encounters it as ready emit bytecode for the whole block in one go. If someone calls the same nodegroup 20 times in a row (or many more like the mandelbrot) it should still generate 20x the +-same bytecode for svm, not 1 block and a bunch of calls to it.

Doing it this way prevents generate_svm_nodes from executing a partial node group and starting work on some other nodes/groups, leaving a messy stack in the process. Take a look at the mandelbrot_1 nodegroup in mandelbrot_97, the iteration counter is pretty much ALWAYS ready, so it's really easy for generate_svm_nodes to keep executing it, however it's output is always used later on, so every time it runs, it occupies a spot on the stack until we're out of stack space naturally. This patch attempts to stop that behavior by adding an entry barrier.

What i'm saying is: I like the barrier, i don't like the way it's implemented.

Yeah , while it's a nice proof of concept i don't like the way it works, it complicates things, switching to a depth first search (to deal with the branching tree layout like the one in #45628) or any other smarter graph segmenting method would be nearly impossible with this extra meta data attached to nodes. I'm not suggesting function calls in SVM. What this patch really does is insert a 'virtual' node that has all the inputs of the nodegroup, so generate_svm_nodes doesn't enter the 'virtual' nodegroup until all inputs are satisfied. What i'm suggesting is stop faking it and just feeding generate_svm_nodes an actual nodegroup node. and when generate_svm_nodes encounters it as ready emit bytecode for the whole block in one go. If someone calls the same nodegroup 20 times in a row (or many more like the mandelbrot) it should still generate 20x the +-same bytecode for svm, not 1 block and a bunch of calls to it. Doing it this way prevents generate_svm_nodes from executing a partial node group and starting work on some other nodes/groups, leaving a messy stack in the process. Take a look at the mandelbrot_1 nodegroup in mandelbrot_97, the iteration counter is pretty much *ALWAYS* ready, so it's really easy for generate_svm_nodes to keep executing it, however it's output is always used later on, so every time it runs, it occupies a spot on the stack until we're out of stack space naturally. This patch attempts to stop that behavior by adding an entry barrier. What i'm saying is: I like the barrier, i don't like the way it's implemented.

@LazyDodo, could see how passing actual node group will help for node compilation, but it's actually just moving complexity from one side to another. Going your way would mean making it more difficult to perform other graph optimization -- like, constant fold or removing unconnected nodes and such.

Surely those parts of optimization could be switched to new actual-node-group paradigm, but this will also mean we wouldn't be able to actually instantiate shader group because different instanced might be optimized in a different manner (depending on their inputs, for example).

I'm also still inconviced we should do such optimizations based on just node tree groups. Surely this will be a good start, but we should implement much more generic way of dealing with the problem.

That being said, if you'll have a look into D1676 you'll see implementation of execution group concept i was talking before. Currently it's being filled in based on the group instances, but nobody stops you from implementing some smart clasterization algorithm for the planar graphs as well.

@LazyDodo, could see how passing actual node group will help for node compilation, but it's actually just moving complexity from one side to another. Going your way would mean making it more difficult to perform other graph optimization -- like, constant fold or removing unconnected nodes and such. Surely those parts of optimization could be switched to new actual-node-group paradigm, but this will also mean we wouldn't be able to actually instantiate shader group because different instanced might be optimized in a different manner (depending on their inputs, for example). I'm also still inconviced we should do such optimizations based on just node tree groups. Surely this will be a good start, but we should implement much more generic way of dealing with the problem. That being said, if you'll have a look into [D1676](https://archive.blender.org/developer/D1676) you'll see implementation of execution group concept i was talking before. Currently it's being filled in based on the group instances, but nobody stops you from implementing some smart clasterization algorithm for the planar graphs as well.

Added subscriber: @BartekMoniewski

Added subscriber: @BartekMoniewski

Added subscriber: @bartus-4

Added subscriber: @bartus-4

Added subscriber: @dfelinto

Added subscriber: @dfelinto

Changed status from 'Confirmed' to: 'Archived'

Changed status from 'Confirmed' to: 'Archived'
Dalai Felinto self-assigned this 2019-12-23 18:42:36 +01:00

Hi, thanks for your patch.

We are undergoing a Tracker Curfew where we are automatically closing old patches.

If you think the patch is still relevant please update and re-submit it. For new features make sure there is a clear design from the user level perspective.

Hi, thanks for your patch. We are undergoing a [Tracker Curfew ](https://code.blender.org/?p=3861) where we are automatically closing old patches. If you think the patch is still relevant please update and re-submit it. For new features make sure there is a clear design from the user level perspective.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
8 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#46872
No description provided.