Page MenuHome

Compositor improvement plan
Confirmed, NormalPublicDESIGN

Assigned To
None
Authored By
Sergey Sharybin (sergey)
Mar 6 2020, 12:46 PM
Tokens
"Love" token, awarded by mantissa."Love" token, awarded by MrJomo."Love" token, awarded by sonicdee1981."Love" token, awarded by Tetone."Love" token, awarded by EAW."The World Burns" token, awarded by bintang."Party Time" token, awarded by sam_vh."Love" token, awarded by DaPaulus."Love" token, awarded by dpdp."Love" token, awarded by sebastian_k."Love" token, awarded by jbakker."Burninate" token, awarded by Schamph.

Description

Overview

This is an initial pass on formalizing proposal how to improve performance and user experience of Compositor.

There are three aspects which are aimed to be addressed:

  • Ease of use
  • Performance
  • Memory consumption

Ease of use

Currently there are many settings which needs to be set and tweak to have best performance: tile size, OpenCL, buffer use and so on. Settings also have some implicit dependencies under the hood: for example, OpenCL needs big tile size, but big tile size might make some nodes measurably slower, and bigger tile size "breaks" the initial intention of compositor design to show tiles appearing as quick as possible.

For OpenCL case it's also not clear when it's actually being engaged: it will just fail silently, falling back to CPU, giving false impression of having GPU accelerated compute.

Performance

Performance of compositor is not up to date. Partially due to its scheduler design, partially due to technical implementation which is per-pixel virtual call (which ruins all sort of coherency).

Memory usage

This is something what is absolutely out of artists control: some operations require memory buffers before/after the node and those are being created automatically. This makes it hard to predict how much memory node setup requires, and how much extra memory is needed when increasing final resolution.

Solution

Proposed end goal is: deliver final image as fast as possible.

This is different from being tile-based, where goal was to have first tiles to appear as quick as possible, with giving gradual updates. The downside of this is that overall frame time is higher than giving an entire frame at once. Additionally, tile-based nature complicates task scheduler a lot, and makes it required to keep track of more memory at a time.

It should be possible to transform current design to proposed one in incremental steps:

  • [Temporarily] Remove code which unnecessarily complicates scheduler and memory manager which is GPU support.
  • Convert all operations to operate in a relative space rather than pixel space (basically, make it possible to change final render resolution without changing compositor network setup).
  • Make compositor to operate on final resolution which closely matches resolution of the current "viewer": there is no need to do full 8K compositing when final result is viewed as a tiny backdrop on Full HD monitor.
  • Modify operations to operate on an entire frame (or on a given area).
  • Modify scheduler to do bottom-to-top scheduling, operating on the entire image.
  • Modify memory manager to allocate buffers once is needed and discard them as soon as possible.
  • Vectorize (SIMD) all operations where possible.

Look into GPU support with the following requirements:

  • Minimize memory throughput, which implies the following point.
  • Have all operations implemented on GPU, which again implies following point.
  • Share implementation between CPU and GPU as much as possible.

The steps can be gradual and formulated well-enough to happens as code quality days in T73586.

Event Timeline

Sergey Sharybin (sergey) changed the task status from Needs Triage to Confirmed.Mar 6 2020, 12:46 PM
Sergey Sharybin (sergey) created this task.

This is going to be a controversial suggestion given it's a chunky new dependency, however...halide is a DSL designed for exactly this kind of problem (takes in account locality, parallelism, vectorization etc) , it's mature (8+ years old, and still actively maintained and in use by both google and adobe) it supports CPU/GPU/SIMD/Threading/Metal/OpenGL out of the box, it is based on LLVM however which generally has a high startup cost, but they mitigated that by allowing to pre-build the kernels at compile time (rather than runtime) It has a python API so you can make kernels in python (still with all hardware support mentioned earlier) which could be good for us addon-wise (but you're back to a small run-time cost at that point).

If anything watch the video on the bottom of their home page to see what it is about, I feel this is a really good match for the compositor and should be considered.

@Ray molenkamp (LazyDodo) Thanks for pointing it out. I am aware of Halide and was considering it for tracker/compositor a while. Nowadays not that much sold. With the function nodes it feels like we can fill in missing gaps and achieve same level of functionality with building blocks which are native to Blender. But will see.

I have experimented with it in the past, where it really shines is how easily you can change a schedule add/changes data-layout/threading/caching/vectorization or go, "now run it on the GPU! now do CUDA! no OpenCL! use DX12!" by changing a line of 2 of code. It is by far the best performing code I have written with the least amount of effort.

However not everything they offer is a good match for us, fusion across the whole nodegraph while theoretically awesome will just never work for us, the run-time cost to schedule and jit them is just too high (once done the perf is great.. however spending several seconds in schedule and jit so you can run a graph under 3ms just ruins any gains to be had) if we use it, it be best to thread every node as it's own block and optimize that at build time rather than trying to run-time optimize the graph the user puts together.

Funciton nodes is still to 'unknown' for me to make a call on, I think there's pro's and con's for both solutions, however gaining a dep as large as halide is definitely hanging out on the con side of things.

If Possible:
Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.

If Possible:
Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.

That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast.

That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast.

That's true as well.

To avoid keeping the output buffers of each node, perhaps a cache node would be a good first step, so the user could decide which points in the tree to cache (after a denoise node for example). Something similar to the houdini approach, additionally a freeze toggle on each node in case the user doesn't want to use the output of a file cache node in a different branch (or different compositor if multiple compositors become possible with a compositor node). This way anything below the file cache node would not need to be recalculated each time an upstream parameter change occurs.