Page MenuHome

New compositor system proposal
Needs Triage, NormalPublicDESIGN

Authored By
Manuel Castilla (manzanilla)
Oct 12 2020, 2:55 PM
Tokens
"Love" token, awarded by Raimund58."Love" token, awarded by Blackx."Love" token, awarded by Pipeliner."Burninate" token, awarded by zazizizou."Love" token, awarded by aliasguru."Burninate" token, awarded by CAEL."Love" token, awarded by gilberto_rodrigues."Burninate" token, awarded by julperado."Love" token, awarded by Tetone."Love" token, awarded by looch."Love" token, awarded by Kronk."Love" token, awarded by mindinsomnia."Love" token, awarded by GeorgiaPacific."Party Time" token, awarded by manitwo.

Description

This is an idea/proposal for a new compositor system implementation. I've been implementing it in a branch called compositor-up . I consider it in experimental state yet. My intention is to keep working on it, it's not trying to push this into blender now. It's with the perspective of being integrated in the future. I'm aware it's hard for this to be accepted as I've done it out of whim without consulting anything with blender developers, and if accepted a lot of requirements are going to come. I'm willing to do any requirements once I explained the reasoning behind what I did so everyone can better decide.

I'd like to clarify that I haven't taken this idea or solution from anywhere. For good or for bad this solution comes from my own experience developing software in general and blender itself, as I'm still using some ideas and code from current compositor and cycles.

Main Goals

  1. Improve performance overall being efficient in memory usage (for caching user can choose how much RAM to use).
  1. Being able to cache at any point of the node tree and uniquely identify the cached image of the node tree state it has saved. So that if the node tree changes and anytime gets back to the state when the cache was saved it may be retrieved.
  1. Make it easier to implement image algorithms and provide a way to write code that is compatible for both CPU and GPU execution, abstracting the computing system being used for GPU.

General Overview

This is a class diagram of the main classes

Buffering, writing and reading pixels

  • In current blender implementation, writing and reading calls are executed pixel by pixel on every read (not saved on a buffer) for many operations. So if there are many operations reading from a previous operation in the tree (several connections to its output socket) the pixel is calculated as many times as readers are reading. For complex operations this is not an issue as they are surrounded with a WriteBufferOperation and a ReadBufferOperation which behave as buffers but these buffers are not recycled or freed until the end of the execution. So memory usage can be very high for trees with lots of nodes with complex operations. The system is tile-based and executes the whole tree for every single tile from output to input in threads so that it can show the tiles to the user as soon as possible.
  • In this new system instead of executing the tree in tiles, it launches the writing jobs per operation and writes the entire operation result before next operation is executed. The intention is to display the entire image result as soon as possible, being able to cache operations results and have available the whole operation result for reading in next operation so that is easier to implement image algorithms. Calculation and writing of the pixels are always done on a precreated/recycled full operation size buffer. For CPU writing the operation its divided on as many rectangles as threads the system can execute for best performance, there is no need for the user to choose "chunk size" anymore. For GPU writing is divided depending on the best work group size for the GPU device. Writing is only and always done once for every operation, no matter how many readers there are. Operations buffers are always recycled once writing and readings of the operation are finished. If in the image algorithm there is need to create buffers for calculations, the buffer recycler may be used to request recycles and once they are not needed they must be given back to the recycler. This avoids a lot of allocations and deallocations of buffers which could affect performance and memory consumption.
  • The operations buffer class is TmpBuffer. This class is used for both CPU buffers and GPU (ComputeDevice) buffers. Depending if the operation is computed or not, BufferManager will automatically create or map/unmap operations buffers memory from host to device and vice versa. When you are coding image algorithms you don't have to worry about this.

Execution phases

The compositor tree execution is done in two phases:

  • First phase (OperationMode::Optimize): All the operations are executed in the same order as the second phase but do not execute the image algorithms for writing buffers. They just calls a ReadsOptimizer that counts all the reads that the operation receive. Knowing how many reads an operation will have during "Exec" phase allows to know exactly when the operations buffers are not needed anymore (because if an operation reads count reaches the count being calculated during "Optimization" phase, it means all his readers have read it already and buffer may be recycled). Furthermore it's used to optimize cache loading because it register the order in which caches has been requested, so that during Exec phase we can get current operation cache and launch a prefetch task for getting the next cached operation buffer and in the meanwhile other operations are being written. The following sequence diagram shows the application flow every time an operation getPixels() is called during OperationMode::Optimize.

  • Second phase (OperationMode::Exec): All the operations are run normally, executing all image algorithms and writing the results to buffers. The following sequence diagram shows the application flow every time an operation getPixels() is called during OperationMode::Exec.

Note: The diagrams may not exactly match the functions names and all functions calls that are involved in the real implementation, but closely represent them. I simplified them for the sake of better understanding the flow of the application.

Using hashes as part of operations tree state IDs (OpKeys)

Using hashes as IDs in most cases is a bad idea because of hash collisions. Still this whole system is based on this idea because the benefits it brings for this case scenario are by far greater than the downside of accepting that there is a minimal chance of getting an "OpKey collision" (OpKey has hashes) that would result in a bad render. OpKeys contains a representation of an operation tree state taking into account all the parameters and operations types involved that produces the operation image result. All operations generate their own OpKey on compositor execution initialization.

Benefits of using hashes for building OpKeys:

  • They're auto regenerated on every compositor execution, and if nothing has changed respect previous execution they will be the same. No need to save them between executions.
  • Being able to cache at any point of the tree without having to implement UI logic that would have to tell the compositor system "I have modified this part of the tree so you should invalidate caches ahead". Or either having to save previous node tree graph with all parameters and having to check them against new node trees graphs to see if there has been a change and where.
  • If 2 or more operations have the exact same tree with same parameters from their point to root input (same tree node state), they will be written only once, doesn't matter that they are in different branches.
  • It's an easy way to uniquely identify operations results in current and between compositor executions. So for example to save disk cache files I may just base encode all the OpKey fields values in the filename and on cache request an OpKey will be given and from it the cache filename can be known without having to create an index.
  • As caches are uniquely identified with values that represent the operations node tree state at the moment of save, they don't need to ever be invalidated (deleted) because of a change in the tree. If the tree is ever back to the same state as when the cache was saved it will be retrieved, doesn't matter the changes in between. The following video shows the working implementation, keep a look at the progress bar:

These are the fields of an OpKey:

Note: Current OpKey implementation does not include some of these hash fields, but it will.

  • op_width -> operation width
  • op_height -> operation height
  • op_datatype -> type of image data (numbers of channels used per pixel)
  • op_type_hash -> return value of C++ typeid().hash_code() for current operation class type
  • op_state_hash -> combined hash of current operation op_type_hash and all its parameters hashes.
  • op_tree_type_hash -> combined hash of current operation op_type_hash and all its input operations recursively until the last input in the tree.
  • op_tree_state_hash -> combined hash of current operation (op_type_hash + operation parameters hashes) and all its input operations recursively until the last input in the tree.

Every operation has an OpKey which uniquely identifies the operation node tree state in current and between compositor executions. For an OpKey collision to happen all OpKey fields must be the same for 2 operations that should render a different image result (because they have different node tree states). This is different from 2 operations in different node tree branches that have same node tree states and produce same image result. In that case, the fact that the 2 operations have the same key is what should happen and the image result for both operations will be written only once. The following video demonstrates this last case in the working implementation (when duplicating the nodes with same parameters it should take more time on rendering if they were written twice):

For the system to be updated correctly on any change, it's required to implement the "hashParams" method on all operations and call in it "hashParam(param)" on all parameters of an operation that if changed would produce a change in the operation image result. Otherwise the system won't be updated on changing that unhashed parameter because output views are cached and they would have the same OpKey than previous execution. So you'd easily realize sooner or later.

OpKey collisions probabilities

I've found this website that shows a simple table of hash collisions probabilities, assuming the hash method is very good:

https://preshing.com/20110504/hash-collision-probabilities/#small-collision-probabilities

Let's say OpKey would only have the "op_tree_state_hash" field which is what really represents the operation tree state (the other hashes are added for making it much harder for an OpKey collision to happen):

Example 1: No cache nodes. Compositor tree with 950 nodes and an average of 2 operations per node = 1900 OpKeys. The closer value in the table for 64 bits is 1921 hashes -> 1 bad render per 10 american trillion renders (10 european billion renders)

Example 2: With cache nodes and disk cache. Compositor tree with 950 nodes and an average of 2 operations per node + 600 thousand disk cache files = 601900 OpKeys. The closer value in the table for 64 bits is 607401 hashes -> 1 bad render per 100 million renders

But for an OpKey collision to happen the other hashes would have to be the same too, so chances of collision are much much lower.

Let's get in the worst case scenario. The user has gotten a bad render because of an OpKey collision. If he realizes about the bad render, he would probably try to select the frame where it occurred and try to re-render it. For some inputs it would automatically produce different hashes as RenderLayers node does because it re-renders the scene again, that would fix the frame. If doesn't fix it, he may try to tweak anything in the compositor tree and fixed. Even when he doesn't know why it has happened, it's very possible that he may try to do these things.

Implementing image algorithms in operations

  • In current blender implementation when implementing an operation you have to take into account many things because the system is tile-based, many operations may be implemented with per pixel methods and not all operations are buffered. As many image algorithms need more context than just the current pixel or tile, there isn't a ubiquitous way on how operations must be implemented for any given image algorithm, it depends on the needs of the image algorithm. So when you need more context than just the current pixel, many more methods are involved. For example for initializing tile data and telling the area you need to read from input operations so that they are buffered.
  • In this new system you just need to implement execPixels() where you first get the whole buffers of all the input operations you need to read by calling getPixels on them. Then you may define a lambda function with the image algorithm that will receive all rectangles that must be written one by one on multithreading. In case of an operation that has Compute (OpenCL) support, instead of doing a cpu function, you define a kernel method in the same cpp file which can be executed either by the CPU as c++ code or as a OpenCL kernel by GPU when is available and enabled. As in the previous system you still need to implement the determineResolution method when the operation have a predetermined resolution. On cpu writing you receive a WriteRectContext which you can use to check the total number of rects and the current pass you are in. This is useful for complicated algorithms in which you may need several passes to do precalculations before writing, you may override the method getNPasses() (default is 1) to tell how many passes you want. See ToneMapOperation as an example. If the operation doesn't support computing because it can't be implemented in a kernel, you have to override the method isComputed() and return false, by default is true as I'm trying to implement as many operations that can be executed with OpenCL as possible. In this branch there are already implemented a lot of operations that can be executed with OpenCL which in current blender compositor aren't. At the end of execPixels() you always have to either call cpuWriteSeek or computeWriteSeek depending if operation is computed or not, so that the write operation is executed. The call is synchronous.
    • Example of an operation with compute support -> Source Code
    • Example of an operation with only cpu support -> Source Code

Implementing compatible CPU/GPU code for computing systems (OpenCL right now)

  • As you may have seen in operations kernel code, it's using some macros. It's needed for abstracting between CPU (c++) code and the computing system code (OpenCL). It helps to simplify the implementation too as you always work with image coordinates and don't need to know about buffer offsets or paddings. Other reason is that the image being read may come from a SingleElemOperation and would contain a single pixel, the macros are aware of this and they just read always that pixel. Thanks to this abstractions up until now the code is 100% shared between C++ and OpenCL for operations that support computing. Taking into account the quantity of operations being implemented, it helps maintainability a lot.
  • For kernel abstractions I took code from cycles. I've been modifying a lot of it as I was seeing the needs of the compositor so even if it seems I'm duplicating a lot of code it's not that much.
  • As everything is very abstracted it should be possible to add other computing systems in the future like CUDA without much work but I don't think is a priority.
  • I did a tool "defmerger" similar to "datatoc", so that I can write each kernel in the same cpp file as the operation. I just surround the kernel code by "#define OPENCL_CODE" and "#undef OPENCL_CODE". Later defmerger will look through all the operations files for the code between OPENCL_CODE tags and merge it into a single file or string. The header "#include "COM_kernel_opencl.h"" is added and it uses a method I took from from cycles that resolves all the includes and preprocessing stuff so that is ready for reading by OpenCL. The tool also allows to write specific code for only C++ or OpenCL in between kernel code, but I never needed it and should be avoided. It would be like this:

Vectorizing

Now there is the possibility of using vectors. Vectors types implementation taken from Cycles, just modified or added what was needed. I've been vectorizing operations code where possible, not only for improving performance, many times makes the code more simple and readable.

Sampling

  • Now any kind of sampling is always done over the result of the operation being read. In current blender implementation, due to not all operations being buffered, sampling is done over the last buffered operation, which may be the last operation (the operation being read) or not. It affects very little to the output result but it probably does slightly. To do sampling over an operation behind the operation you want to sample and execute the algorithm of the operation being read over it, it's not desirable I think. And the behavior wasn't consistent, depended on which operations were buffered. So now everything behaves more consistently, as all operations are buffered.
  • But as now all operations are buffered, simple distort operations that in current blender implementation need sampling but are not buffered (scale and rotate only) cannot be concatenated one after another without sampling being applied for each one of the operations. So every time you use a scale or rotate node the operation takes effect immediately and is sampled. Depending on the point of view this might be a disadvantage or an advantage, but I think a new/non-expert user would always expect that the operation takes effect immediately. It can be very confusing that you scale down and then scale up and you get the exact same image quality, if you insert in between non buffered nodes it'll be the same, but if you insert a buffered node suddenly the image appears pixelated (of course he doesn't know which ones are buffered nor what is a buffer). Some users might understand the behaviour but most of the time is confusing for the average user.
  • I implemented the transform node as a single operation so that rotate and scale can be combined in a single operation and sampling is only done once -> TransformOperation

Note: Current blender implemention uses sampling for TranslateOperation too. I'm not using sampling for that operation in this implementation as I don't think it's needed float precision, it would cause more harm than good. Specially since this operation is automatically used for centering images when mixing 2 images in an operation.

Changes / New features

  • Any data sockets: I created a new socket type (green), it just intends to indicate to the user that he can input any kind of image data (1 (gray),3 (purple) or 4 channels (yellow)) and it won't be converted. For output sockets it means that it will be same type of data as the main input socket. An example is the Scale node which it doesn't matter how many channels the data has, you just want to resize image.
  • Cache Node: You can place this node anywhere in the tree and the previous operation result is cached, if you modify a node ahead of this node it calculates everything from this point only. If you modify a parameter or the tree structure behind the Cache Node it will automatically recalculate everything is behind and cache it. It behaves as both memory and disk cache. Firstly it will cache to memory, if memory cache fills up to the limit set by the user, it will discard oldest used caches as needed and save them to disk cache if available (otherwise delete them). Disk cache behaves in the same way, just oldest caches are always deleted. If user wants to use only disk cache, he may just enable it in preferences and set memory limit to 0. No compression at the moment, probably an option will be added in the future.
  • Previews and Viewers are now cached: It's necessary to do this because UI may have glitches or just calling the compositor execution when it really don't need update as in fact happens. For example if you disconnect a socket by pressing without releasing and connect it again in the same place it calls the compositor to recalculate everything when it's not necessary. So now if such thing happens, the compositor operations hashes would be exactly the same so it just returns the cached previews and viewers very fast.
  • Option to change Preview Quality: Previously previews were always 140 pixels, if you zoomed in or increased the size of the nodes you would see pixelated previews. Setting previews to high quality affects almost nothing to performance. Right now you may choose:
    • Low Quality = 150 pixels (default)
    • Medium Quality = 300 pixels
    • High Quality = 450 pixels
  • Option to Scale Inputs down: This option is a fast way to reduce the size of inputs (images, renders, textures, masks, video clips...). Most of the time user don't need them to be the original size, only when going to render the final result. So when working and testing different parameters in the nodes instead of zooming out the view, user may try to scale down inputs with this option because it will increase the performance a lot and at the same time reduce the size of the viewer. It affects to the resolution of all the nodes from the input to the output. It needs all the nodes values to work with relative values to produce same result as in 1.0 scale. So for nodes with operations with absolute values or with pixel based effects, scaling down will produce a more heavy effect. These issues are meant to be resolved once I implement relative space so that all values are converted to relative space and then scaling down should produce more approximate results to full resolution but it will never be perfect.
  • Video Sequencer Node ? : Allows you to use any video sequencer channel as compositor input. It may seems illogical as VSE goes after the compositor in the render pipeline. But many people find it very useful. And it's faster than movie clip node because it benefits from VSE cache. But as of now it's not possible that this could make it into blender because VSE calls the render pipeline when using scene strips and that causes a dead lock if Video Sequencer Node is used. I'm circumventing the issue by calling a non multi-threading ready method, but that's obviously not good, causes OpenGL drawing errors in VSE previews. If VSE instead of calling the render pipeline in main thread launched an asynchronous job and gets the results from files, I think it should be possible and UI wouldn't hang that much when using scene strips in rendered mode. But I don't know much about this area, just trying to find a solution.
  • Pixelate Node: Previously this node required the user to surround it with scale nodes with inverse values to get a desirable effect . Now a size option has been added, no need to surround it with scale nodes anymore.

Removed options from UI

  • Buffers groups: Not needed anymore, as now all the operations are buffered.
  • Chunk size: Now how operations writing is divided is implementation defined (depending on the number of threads system can execute at full performance and best work group size for GPU devices). This how it must be since the user shouldn't care about these things.
  • Two pass: This option skipped the execution of nodes that could take too much time and low priority outputs (viewers and previews I guess) on first pass, and then executed again the whole tree on second pass but with the nodes that took a lot of time. I don't think this is needed anymore, mainly because of cache nodes, inputs scale option and just performance in general has improved. It's better to just show the final image as soon as possible. If I try to keep this option it would imply some handicaps to the implementation, specially for caching.

Performance tests

The tests are done with a 1920 x 1080 image.
Operating system: Windows-10-10.0.18362-SP0 64 Bits
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Graphics card: GeForce GTX 1060 6GB/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38

Test 1 - A bunch of nodes that are not currently buffered in blender compositor but in compositor-up all of them are. With CPU:


  • Blender 2.91 master: 1.15 seconds / 221.62 MB (peak memory)
  • Blender 2.91 compositor-up: 0.36 seconds / 309.31 MB (peak memory)

Even when there is no buffered operation other that inputs and outputs in current blender implementation and in compositor-up all operations are buffered, there is no big difference in memory. And that's because buffer recycling it's actually working, otherwise peak memory would be much worse. Doesn't matter how many nodes you add linearly, they are recycled all the time. Peak memory would depend more on how much branching there is from operations outputs because the BufferManager has to keep the buffer until all the branches finish reading the operation result. So it has to keep it longer, and it may need more memory.

Test 2 - A few nodes that are most of them buffered in current blender compositor, this nodes have more complex operations. With CPU:


Result:

  • Blender 2.91 master: 1.66 seconds / 724.65 MB (peak memory)
  • Blender 2.91 compositor-up: 0.72 seconds / 305.82 MB (peak memory)

In compositor-up memory usage is about the same as test 1 for complex operations as they are recycled in the same way. Current blender implementation keeps the buffers from start to end of the execution so the more nodes you add with complex operations the more memory it consumes.

Test 3 - Same as test 2 but with OpenCL:


Result:

  • Blender 2.91 master: 16.45 seconds / 763.45 MB (peak memory)
  • Blender 2.91 compositor-up: 0.21 seconds / 153.97 MB (peak memory)

About the time, I know very few nodes are implemented with OpenCL in current blender implementation, but there is clearly something wrong. It may happen when mixing nodes that can be executed with OpenCL and others that don't. Bokeh blur I'm sure it's implemented with OpenCL and I think it works fine if you use only that node.

About the peak memory RAM in compositor-up it's because now is mostly using and recycling GPU memory in the same way as was done before for Memory RAM, so it needs much less RAM.

In compositor-up every time you enable OpenCL it'll take some time to do the first render because it has to load the OpenCL program, but from there it should go smooth. It's in my TODO list to make it compile and save it in cache.

As you may have noticed, image results are slightly different, I'm pretty sure is related to sampling. In test2 too but even less noticeable. As I've already explained current blender implementation sampling is done over last buffered operation which may be the last operation or not, and in compositor-up is always last operation. That could be part of it, but in this test3 other thing is that I'm using native OpenCL sampling and the algorithm or precision may be slightly different. I may just use always blender sampling implementation for both CPU and GPU in the future to avoid differences. My intention is to do proper render tests per node, compare it with current blender implementation and fix differences where possible.

Important TODOs:

  • Fix cache node not working for complex trees when placed in several branches.
  • Implement relative space.
  • Clean code removing prototyping stuff, as GlobalManager, and pass managers and CompositorContext as arguments.
  • Use more blender library methods where appropriate, as for example, for hashing.
  • Make render tests and OpKey collisions tests.

GitHub repository: https://github.com/m-castilla/blender-compositor-up

Binaries: https://github.com/m-castilla/blender-compositor-up/releases

Event Timeline

Thanks for such a detailed presentation of your proposal.

There will be some points to discuss/agree on :)

Think it does make sense to go back to entire-image-as-fast-as-possible approach. It does simplify a lot of things for both users and developers. On a seemingly related topic, one thing we always were looking forward is to become more of a canvas-based. On of the examples of that is to be able to undistort image, perform modifications, distort it back and not have any black areas around the "barrel". Is this something what is still possible with the buffers design you've described?

On the optimization step, you're talking about number of reads. Is it number of read images, or number of read pixels? I'm actually a bit lost in this step. Do you have some simple example to visualize how this step works?

The cache keys I'm not sure is the right direction here. Think we already have all bits needed for fully reliable system: we have SessionUUID concept, we have update tagging and dependency graph. Think reliable cache system can work the following way:

  • Cache stores outputs of node, keyed by SessionUUID of a node and its output socket.
  • RNA/depsgraph tags nodes when their dependency or setting changes.
  • Compositor "flushes" tags starting from node to the outputs, invalidating cache of the dependent nodes.

The GPU support sounds interesting, but in practice it might easily become more of a maintenance burden. Not sure it should be a priority from the day 0, but in longer term I think its interesting to get GPU acceleration.

One thing which strikes my attention in the design is the need of defmerger tool, and the fact that host and device code is so interleaved. It should be possible to have a separation level between host and device, and avoid need in extra tools, and keep code nice and readable.

Another thing which I'm not sure is how the operation buffers are managed for GPU? You wouldn't want to have data transfer back and forth for every node, so you kind of want to keep everything on GPU as much as possible, ideally have entire process happening on GPU. Without pushing it out of VRAM somehow. How do you manage this?

Thew performance tests shows very nice speedup! Just to confirm, it is not caused by the caching, right?

Thanks for looking into it :)

As GPU support is not a priority we could consider it not part of the proposal. Same for caching/OpKeys and new features. Would simplify things a lot. Just improving CPU performance and memory consumption for the time being. If anytime it's decided to support things I've talked about in the proposal, we can always come back to it.

About canvas-based:
Yes, buffer recycler will create or get a buffer that has enough memory for the resolution of the operation. The resolution is set in NodeOperation::determineResolution as always, it can be any. If in the operation you need more buffers to do intermediate modifications, from operation code you can request recycler more buffers for any resolution and once finished give them back. The "fit" option of lens distortion node is an example right? I've just tried it, works fine.

About optimization step:
It's the number of image reads (whole operation buffer result). An example node tree:

ImageOperation -> GammaOperation -> ViewerOperation

First phase (Optimization): The operations call getPixels() on their input operations to get their output buffers, like this:

ViewerOperation::getPixels() -> GammaOperation::getPixels(this) -> ImageOperation::getPixels(this)

With "this" argument the caller operation is telling the read operation that he is the reader. Then read operation reports to ReadOptimizer it has received a read. ReadOptimizer will count the read and register the operation that did it. ReadOptimizer actions log in order:

- Register GammaOperation has read ImageOperation and increment ImageOperation received reads
- Register ViewerOperation has read GammaOperation and increment GammaOperation received reads

At then end of the phase it has this information:

- ImageOperation: Read operations = NONE | Received reads: 1
- GammaOperation: Read operations = [ImageOperation] | Received reads: 1
- ViewerOperation: Read operations = [GammaOperation] | Received reads: 0

Second phase (Exec): Reads information is retrieved by BufferManager and use it to recycle buffers. Now what getPixels does is trying to get the operation output buffer from BufferManager. If it's already written returns it, otherwise creates the write function object and BufferManager executes it on the output buffer (BufferManager manages the output buffers). BufferManager actions log:

ImageOperation write seek:
- Take buffer from recycler -> (Recycler Buffers: 0) -> new buffer.
- Execute write on buffer and save it as ImageOperation output buffer
- Get ImageOperation read operations -> NONE
- Return output buffer.

GammaOperation write seek:
- Take buffer from recycler -> (Recycler Buffers: 0) -> new buffer
- Execute write on buffer and save it as GammaOperation output buffer
- Get GammaOperation read operations -> [ImageOperation]
- Increment ImageOperation received reads -> Current/Total Reads: 1/1 -> (Total reads reached! Recycle!)
- Give ImageOperation buffer to recycler -> (Recycler Buffers: 1) 
- Return output buffer

ViewerOperation write seek:
- Take buffer from recycler -> (Recycler Buffers: 1) -> get buffer -> (Recycler Buffers: 0)
- Execute write on buffer and save it as ViewerOperation output buffer
- Get ViewerOperation read operations -> [GammaOperation]
- Increment GammaOperation received reads -> Current/Total Reads: 1/1 -> (Total reads reached! Recycle!)
- Give GammaOperation buffer to recycler -> (Recycler Buffers: 1) 
- Return output buffer

It can be thousands nodes, it'll recycle all the time. If nodes have different resolutions it's the same, recycler always tries to find a buffer that has enough memory for the resolution, if none found it'll create a new one.

About tests
There is no cache.

The following is not important for the time being. I'm sure you could help me to improve it or find a better solution but we would need a long conversation to understand each other. It's just to answer.


About caching:
What you described it's better indeed if it's ok to invalidate the cache. In the proposal I want to keep the caches indefinitely so that it will be retrieved no matter previous changes anywhere in the tree. Only oldest caches are deleted when maximum cache size set in options is reached. Example:

ImageOperation -> BrightnessOperation -> CacheOperation -> GammaOperation -> ViewerOperation

- Tree is executed once normally. Brightness parameter is 0.0, gamma parameter is 0.0. Cache operation saves a cache in both approaches. 
- Change gamma paramater to 0.1:
    · SessionUUID: Cache retrieved
    · OpKey : Cache retrieved
- Change brightness paramater to 0.1:
    · SessionUUID: Previous cache is invalidated, new cache is saved.
    · OpKey: New cache saved. Previous caches are kept.
- Change brightness paramater to 0.0:
    · SessionUUID: Previous cache is invalited, new cache is saved.
    · OpKey: Cache retrieved
- Change brightness paramater to 0.2:
    · SessionUUID: Previous cache is invalited, new cache is saved.
    · OpKey : New cache saved. Previous caches are kept.
- Change brightness paramater to 0.1:
    · SessionUUID: Previous cache is invalited, new cache is saved.
    · OpKey: Cache retrieved
- Delete whole node tree:
    · SessionUUID: Previous cache is invalited.
    · OpKey: Do nothing.
- Build the exact same tree with same parameters as before:
    · SessionUUID: Has no cache, save new cache.
    · OpKey: Cache retrieved

OpKey approach works for after and before cache node modifications as it won't be calculated twice. Just by using cache node at the end of the tree, whole tree would benefit. As a user I switch between same values a lot to compare results, with cache once calculated is fast to compare values anywhere. It would work for animations too if they were cached. But of course by doing all this I'm adding a lot of code. I understand if you consider it's not worth maintaining it. In such case I would do it as you said but it worries me that if in any operation you are using a value that is not calculated/gotten from nodes and it affects the image result, dependency graph won't know about it and cache won't be invalidated. With the OpKey approach you can hash anything in hashParams() as in that context you know and can access anything the operation uses. It could happen because there are operations that use other blender libraries.

About GPU separation layer:
If you mean separating GPU/CPU code in operations, it would be cleaner and I wouldn't mind doing it but would have to maintain two versions of the same image algorithm, one for the cpu in a .cpp file and other for the gpu in a .cl file using opencl specific code. Gives room to introduce differences in the algorithm between CPU and GPU, and it needs bug fixing/debugging for both versions. In my case I haven't found any OpenCL debugger that works for me. "printf" only.
As I've abstracted what is needed from OpenCL, I only have to maintain one version.
The "defmerger" tool allows me to write code once in the cpp file only. It looks through all the operations cpp files and merge the code into an OpenCL file on compiling. If code works for CPU, it automatically works for GPU. The benefit I get by debugging CPU code translates to GPU even if I'm not really debugging "GPU" code. It's rare the case that code won't work for both, just need to take into account that CPU code executes a loop, GPU don't. I rarely need to change code in defmerger or GPU abstraction. Adding new operations and fixing bugs in them is much more frequent and all operations code is much large and complex, cannot afford two versions of it. If in the future want to replace OpenCL with Vulkan, it may be enough adding new code to GPU abstraction, don't need to change any operations code.
It has the downside that I have to use some macros.
One thing to improve is that BufferManager and BufferRecycler have to ask a lot about if a buffer is from CPU or GPU, that needs more thought.

About how the operation buffers are managed for GPU:
GPU buffers are recycled the same way as CPU buffers. But it has to take into account that there are operations that has compute (GPU) implementation and others not. If an operation is computed it'll create/recycle a GPU buffer. If next operation is computed too it uses GPU buffers from previous operations keeping them in GPU memory. If next operation doesn't have compute support it has to map the input buffers to host memory, there is no choice. To be able to execute tree from start to finish in GPU only, I would have to get rid of all nodes that have no GPU support. There are nodes that are impossible to implement in GPU, for example denoise node as it uses a library. I think is good to have the flexibility to support all kinds of nodes but try to implement as many with compute support as possible. From the profiling I've done, the performance impact of "queueing job, execute it" several times instead of queueing all jobs or do a single job and execute it together is minimum. Here a job/task is an operation write algorithm on a whole image, is a lot of work. What has a big impact in performance is map/copy gpu buffers to host memory and viceversa. I try to avoid it as much as possible, only do it when needed.

As GPU support is not a priority we could consider it not part of the proposal. Same for caching/OpKeys and new features. Would simplify things a lot. Just improving CPU performance and memory consumption for the time being. If anytime it's decided to support things I've talked about in the proposal, we can always come back to it.

Putting main focus on getting CPU to work nice, fast, with reasonable amount of memory, seems like a good choice to me.
Further development can then happen on top of it.

About optimization step

Not sure why there is a need of any support from the operation itself to provide information about reads. To me clear and robust approach would be to handle it from the scheduler side, based on nodes and links in the compositor graph.

The re-usable memory buffers is another complexity which is not immediately clear to me why is it required. Doing fame-sized allocation is fast enough compared to logic of node. So to me it is important to free buffers as soon as possible, to keep memory usage low, but attempts to avoid allocation calls doesn't seem to be that important. One thing which can be here is to make it so all buffer (de)allocation happens in centralized place, so that if allocation ever pops in a profiler it can be implemented in one single place.

Curious to know if you already made benchmarks like this and saw allocation to be a bottleneck.

About caching

To me depsgraph-based invalidation seems more reliable and clear. While it might be not the fastest in all all cases (as your examples showed) it deals with cases like changed property of a generated image used in Image Input node. Or when one changes distortion model of Movie Clip. Having all those settings hashes from compositor seems quite fragile thing to do.

About GPU separation layer

Wasn't really talking about two different codepath of kernels. Just from the code example it seemed that host and device code are interleaved and are controlled by preprocessor directives. This is something what oneAPI is trying to solve: allow easy interfacing of device and host code. But in the case of CUDA/OpenCL is still cleanest approach to have kernel code separate from device code. Surely, there could be some ifdefs in the kernel to fine tune specific cases, but is more like an exception than a rule.

Not sure why there is a need of any support from the operation itself to provide information about reads. To me clear and robust approach would be to handle it from the scheduler side, based on nodes and links in the compositor graph.

To me it's more robust to do it from the operation because there is where the reading is actually executed. An operation may decide to read or not an input operation based on his own logic. Makes sense to me that an operation avoids to read input operations when it doesn't need it. I wouldn't rely on node graph for anything related to operations execution, it doesn't represent what is actually executed in the end.

The re-usable memory buffers is another complexity which is not immediately clear to me why is it required ....
Curious to know if you already made benchmarks like this and saw allocation to be a bottleneck.

I did benchmarks. Only remember with a lot of color operations which are usually very simple (even 1 line code in loop), around 80 nodes, without recycling 2 seconds and with recycling 1 second approximately for CPU. For GPU even more important. I think it's even worth it if all operations were complex. I'll try to do a simple test that you can easily execute and decide yourself if it's worth it or not. I'll do it other day and post it here.

To me it's more robust to do it from the operation because there is where the reading is actually executed. An operation may decide to read or not an input operation based on his own logic. Makes sense to me that an operation avoids to read input operations when it doesn't need it. I wouldn't rely on node graph for anything related to operations execution, it doesn't represent what is actually executed in the end.

The read can also be conditional. For example, if factor of mix node is controlled by an image, the mix operation might decide to skip reading of either of image inputs based on the mix factor. If you want to check number of reads in this case reliably you'd need to fully execute the operation, at which point you can as well just use the output rather than try to optimize something and run second round of execution.

More straightforward usually process is:

  • Perform constant folding
  • Remove unconnected nodes
  • Possibly de-duplicate input nodes (Images, render layer, etc)
  • Do single pass of execution

I did benchmarks.

What is the configuration of your machine (CPU, OS, RAM)?

Operating system: Windows-10-10.0.18362-SP0 64 Bits
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Graphics card: GeForce GTX 1060 6GB/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38
RAM: DDR4 16 GB

I did a test for buffer recycling. I can only test it on Windows and I get similar results as what I said . On linux on a VM, but not sure if the results are reliable, it's just very slow with both approaches, but the difference in linux seems to be much lower than windows -> https://developer.blender.org/P1830

Perform constant folding

What does this part do? I don't understand what it means.

What does this part do? I don't understand what it means.

Constant folding is something from compilers optimization. The long version on Wikipedia.

In the context of nodes you can think of it as:

If you have mix node with factor 0 (which is not connected to anything) you can always use first input as output (so you never need second input). Similarly, if the factor is 1 you never need first input, and can use first input as the output.

You do such optimizations on the graph itself, by removing node which got "constant folded" and re-routing links. In this example

Image 1 -> Mix with factor 0 -> Blur
Image 2 -----^

becomes

Image 1 -> Blur

Similarly you can remove blur nodes with blur size 0. You can pre-calculate math nodes, things like that.

We do have constant folding in Cycles (ShaderGraph::constant_fold). Worth a look!

I looked at it. I get the idea now, I'll do it as you said. Constant folding logic may be added to each node "convertToOperations".

A few questions/suggestions:

  • I think this should be done in a separate branch and merge it to master once all is working and reviewed, can't have some things working tile-based and others full-size without breaking something. I could still submit small patches to be reviewed in that branch.
  • There are a lot of operations/nodes, is going to take a lot of time to review all patches if submitting one patch per node with all its operations for example. I wonder if doing render tests could help -> D6334: Compositor automated testing
  • Now that all operations would be buffered, they take effect immediately, this will make the behaviour more predictable but some things that work in current compositor won't anymore. T57568: Transform Node Messes Up Alpha Channel Manipulation in this bug report the user wants to scale up to fill the output resolution, it works only for non-buffered nodes. Now scaling up would always crop within the current operation resolution. I suggest adding an "Adjust Canvas" check option to scale/transform node. To avoid users scaling to huge resolutions it could be limited internally to the output resolution set in scene.
  • What to do with GPU code? Should be removed? This task said so -> T73586: Code Quality Day Tasks

I've made a plan taking into account everything we've talked. I still have to register the reads before executing operations, it's not a pass, think of it like determineResolution method. We cannot predict resolutions just by looping operations tree. Here I cannot predict rendered rects and they affect the readings count that we need to know when we can free output buffers.
The following starts from current compositor implemention, let's forget about my branch. Let's see if it's ok:

  1. Remove GPU related code from WorkScheduler and CompositorContext. Delete COM_OpenCLDevice and COM_Device files, only need COM_CPUDevice. Delete COM_OpenCLKernels.cl file. Remove all operations OpenCL code.
  2. Remove tile-based, two-passes and buffer groups related code (except operations code):
    1. COM_ExecutionSystem: Remove code related to execution groups. All operations are buffered so now we can directly execute output operations. First we pass render/viewer border (if any, otherwise full-frame) to output->determineRectsToRender(render_rect) so they can be known on first operation read and avoid executing twice (yes they may be more than 1 rect, e.g. viewer border+first pixel), then output->registerReads() to know when we can free output buffers and then execute algorithms with output->getPixels().
    2. COM_compositor: Remove code related to two-passes.
    3. COM_NodeOperationBuilder: Remove add_complex_operation_buffers() and group_operations() methods and any related code.
  3. Create a BufferManager class for handling creation/recycling/free of all MemoryBuffers, specially for output buffers so that they are freed/recycled as soon as their dependent reads are finished. Returns std::unique_ptr<MemoryBuffer> when giving buffer ownership to the caller otherwise const MemoryBuffer* when caller should not modify it. For function parameters BufferManager takes ownership when param is unique_ptr. BufferManager will set custom destructors accordingly (e.g for recycling). As a first step it will just create/free. Later I'll implement recycling and do benchmarks with real use cases.
  4. COM_MemoryBuffer:
    1. Remove variables related to chunks, memory proxy, etc. Remove MemoryBufferState concept. There are no chunks now.
    2. Will have public variables (width, height, elem_stride, row_stride...) and operator[] for ease of use because now it's used everywhere. Private constructors, you have to create them with BufferManager.
    3. Remove copy/duplicate methods. Only buffer manager can create buffers. For copying there will be an utility class.
  5. COM_WorkScheduler:
    1. Same implementation just we are not scheduling ExecutionGroup chunks now. It'll be used for scheduling any kind of work (std::function<void()>) from a WorkManager class. WorkManager methods split given operation work in rects, ranges..., executing them multi-threadedly with WorkScheduler.
    2. WorkManager will handle the updating of compositor progress bar. Will get reported of operations start/finish to do so. Optionally, for more precise progress updating, operations may override ::getNumberOfWorks to tell how many works they execute, default is 1 which suits most operations.
  6. Create an ExecutionContext class that give access to BufferManager and WorkManager during operations execution. Copy isBreaked() and updateDraw() methods from NodeOperation in this class, it makes more sense here.
  7. Create a BufferUtil class with methods for pixels operations. Will help refactoring code. Examples: copyRect(rect, from_membuf, to_membuf), clearRect(rect, membuf).
  8. Convert all operations from tile-based to full-frame. Now initExecution()/deinitExecution() are called on operation "first received read"/"end of first received read" instead of tree start/end. Only methods we need to implement are execPixels, determineResolution for custom resolutions and getInputAreaOfInterest(input_idx) when we need an input rendered rect different than current operation (e.g. +1 dimensions for bilinear sampling if not full frame). This last method is needed to properly support Render Region and Viewer Region, is much simple than previous determineDependingAreaOfInterest as now we don't need to determine children inputs, instead we just return a rect with the area we need to read from the specific input.
    1. Create a SimpleOperation abstract class that requires extending classes to implement a execPixelsMultiRect() instead of execPixels(). execPixelsMultiRect calls are multi-threaded passing a splitted rect that must be written. Will have overridable onExecPixelsStart() and onExecPixelsEnd() called just before and after rects execution. Most operations will extend this.
    2. Other operations may directly extend NodeOperation. execPixels() call is single-threaded. In execPixels() you may execute parts of code multi-threadedly with WorkManager as you may see fit. This flexibility is useful for current single-threaded operations that are not possible to execute in multiple threads as a whole but parts of it can, e.g glare operations.
  9. Apply constant folding logic where possible in each node convertToOperations:
    1. Add SocketProxyOperation instead of real operation when it's not needed and you want to reconnect to next operation (e.g MixOperation when factor is 0 or 1)
    2. Do not connect inputs when we don't need to read them (e.g MixOperation, do not connect the input with 0 influence)
  10. Deduplicate input nodes in NodeGraph.
  11. Once everything is implemented:
    1. Delete COM_ChunkOrder, COM_ChunkOrderHotspot, COM_ExecutionGroup, COM_WriteBufferOperation, COM_ReadBufferOperation, COM_MemoryProxy, COM_SingleThreadedOperation.
    2. Remove unused methods/variables from previous implementation that were kept to avoid compilation errors. Specially in NodeOperation, SocketReader and CompositorContext.
    3. Remove from UI two-passes, chunk, OpenCL and buffer group options.

Some diagrams, they just have most important things:



Taking a look at all of this quickly and honestly I can't help but think blender compositor needs to be completely separated from blender and the whole thing just needs to follow some industry commonalities.

There are huge bottleneck problems for a Blender when it comes to rendering and comping if you're not doing something straight out of the box from Blender studios workflow.

They're so much to discuss here I'm not sure where to start. On top of that, I am pretty sure blender devs reject most of it and pass it off as too much.

I can't help but think that this is waste of time same as the dev work on the VSE.

I understand the need for these VSE and the Compositor being deved out 10 or 20 years ago. Though doesn't make sense anymore unless the pipeline becomes more integrated with itself and most importantly the rest of the world.

From what I know at the moment the current code doesn't fit with the rest of the world. I'm pretty sure if we continue on this path blender still won't be able to use the entry level for compositors "Open Effects". So it might be time to drop it.

Blender IMO would honestly be better off taking some of its devs & having them focus on building a better integration & improving Kdenlive & Natron. This would actually be more beneficial for Studios and opensource other than just the Blender studio typical workflow.

I know this is pretty cut-throat, honestly, we need to be addressing serious workflow bottleneck issues that cost weeks if not months of lost time and money.

Open to chat about it.

@adam earle (adamearle) please use appropriate channels for such an open discussion.
If you have comments about the design itself, please do it in the ticket.

Your comment does not help us or this design. I am open to any discussion as long as we can do it in a respectable way.

I have no idea what to say to you? What I can say is that I genuinely have good intentions.

So honestly, do you just want to tinker with Blender? or do you really want to help the industry? Which is it?

The best thing I can suggest is to organize to talk with some fast operating compositors from the industry. Compers that use NUKE Fusion After Effects and so on. I mean really sit down and listen and take notes on things like

What are they doing every day?
What are the choke points of projects?
What's taking up the most time?
How do you midgate to with rendering, animation, lighting, tracking, alignment problems?
How do you have a life outside of work? (that's a joke).

I'm sure if you do that things here may just change a little.

If you can work at this level working with the compositor, then honestly you can make a huge shift in the world for the better.

Here's the but though it's up to you mate, and how much you want to be a part of the composting community.

so are you a Tinker or are you a Lifer like me. The rest is up to you and your research. Good luck mate I honestly don't know what else to say to you.

@adam earle (adamearle) thanks for the feedback. I think it's better to continue this discussion on devtalk. As you can see in this nice post it's more helpful to have concrete complaints or suggestions.
For example why is the current state bad? What are you trying to achieve and what are you complaints using the compositor: miss a specific node? Blender too slow? wrong place for a button? etc...