This is an idea/proposal for a new compositor system implementation. I've been implementing it in a branch called compositor-up . I consider it in experimental state yet. My intention is to keep working on it, it's not trying to push this into blender now. It's with the perspective of being integrated in the future. I'm aware it's hard for this to be accepted as I've done it out of whim without consulting anything with blender developers, and if accepted a lot of requirements are going to come. I'm willing to do any requirements once I explained the reasoning behind what I did so everyone can better decide.
I'd like to clarify that I haven't taken this idea or solution from anywhere. For good or for bad this solution comes from my own experience developing software in general and blender itself, as I'm still using some ideas and code from current compositor and cycles.
- Improve performance overall being efficient in memory usage (for caching user can choose how much RAM to use).
- Being able to cache at any point of the node tree and uniquely identify the cached image of the node tree state it has saved. So that if the node tree changes and anytime gets back to the state when the cache was saved it may be retrieved.
- Make it easier to implement image algorithms and provide a way to write code that is compatible for both CPU and GPU execution, abstracting the computing system being used for GPU.
This is a class diagram of the main classes
Buffering, writing and reading pixels
- In current blender implementation, writing and reading calls are executed pixel by pixel on every read (not saved on a buffer) for many operations. So if there are many operations reading from a previous operation in the tree (several connections to its output socket) the pixel is calculated as many times as readers are reading. For complex operations this is not an issue as they are surrounded with a WriteBufferOperation and a ReadBufferOperation which behave as buffers but these buffers are not recycled or freed until the end of the execution. So memory usage can be very high for trees with lots of nodes with complex operations. The system is tile-based and executes the whole tree for every single tile from output to input in threads so that it can show the tiles to the user as soon as possible.
- In this new system instead of executing the tree in tiles, it launches the writing jobs per operation and writes the entire operation result before next operation is executed. The intention is to display the entire image result as soon as possible, being able to cache operations results and have available the whole operation result for reading in next operation so that is easier to implement image algorithms. Calculation and writing of the pixels are always done on a precreated/recycled full operation size buffer. For CPU writing the operation its divided on as many rectangles as threads the system can execute for best performance, there is no need for the user to choose "chunk size" anymore. For GPU writing is divided depending on the best work group size for the GPU device. Writing is only and always done once for every operation, no matter how many readers there are. Operations buffers are always recycled once writing and readings of the operation are finished. If in the image algorithm there is need to create buffers for calculations, the buffer recycler may be used to request recycles and once they are not needed they must be given back to the recycler. This avoids a lot of allocations and deallocations of buffers which could affect performance and memory consumption.
- The operations buffer class is TmpBuffer. This class is used for both CPU buffers and GPU (ComputeDevice) buffers. Depending if the operation is computed or not, BufferManager will automatically create or map/unmap operations buffers memory from host to device and vice versa. When you are coding image algorithms you don't have to worry about this.
The compositor tree execution is done in two phases:
- First phase (OperationMode::Optimize): All the operations are executed in the same order as the second phase but do not execute the image algorithms for writing buffers. They just calls a ReadsOptimizer that counts all the reads that the operation receive. Knowing how many reads an operation will have during "Exec" phase allows to know exactly when the operations buffers are not needed anymore (because if an operation reads count reaches the count being calculated during "Optimization" phase, it means all his readers have read it already and buffer may be recycled). Furthermore it's used to optimize cache loading because it register the order in which caches has been requested, so that during Exec phase we can get current operation cache and launch a prefetch task for getting the next cached operation buffer and in the meanwhile other operations are being written. The following sequence diagram shows the application flow every time an operation getPixels() is called during OperationMode::Optimize.
- Second phase (OperationMode::Exec): All the operations are run normally, executing all image algorithms and writing the results to buffers. The following sequence diagram shows the application flow every time an operation getPixels() is called during OperationMode::Exec.
Note: The diagrams may not exactly match the functions names and all functions calls that are involved in the real implementation, but closely represent them. I simplified them for the sake of better understanding the flow of the application.
Using hashes as part of operations tree state IDs (OpKeys)
Using hashes as IDs in most cases is a bad idea because of hash collisions. Still this whole system is based on this idea because the benefits it brings for this case scenario are by far greater than the downside of accepting that there is a minimal chance of getting an "OpKey collision" (OpKey has hashes) that would result in a bad render. OpKeys contains a representation of an operation tree state taking into account all the parameters and operations types involved that produces the operation image result. All operations generate their own OpKey on compositor execution initialization.
Benefits of using hashes for building OpKeys:
- They're auto regenerated on every compositor execution, and if nothing has changed respect previous execution they will be the same. No need to save them between executions.
- Being able to cache at any point of the tree without having to implement UI logic that would have to tell the compositor system "I have modified this part of the tree so you should invalidate caches ahead". Or either having to save previous node tree graph with all parameters and having to check them against new node trees graphs to see if there has been a change and where.
- If 2 or more operations have the exact same tree with same parameters from their point to root input (same tree node state), they will be written only once, doesn't matter that they are in different branches.
- It's an easy way to uniquely identify operations results in current and between compositor executions. So for example to save disk cache files I may just base encode all the OpKey fields values in the filename and on cache request an OpKey will be given and from it the cache filename can be known without having to create an index.
- As caches are uniquely identified with values that represent the operations node tree state at the moment of save, they don't need to ever be invalidated (deleted) because of a change in the tree. If the tree is ever back to the same state as when the cache was saved it will be retrieved, doesn't matter the changes in between. The following video shows the working implementation, keep a look at the progress bar:
These are the fields of an OpKey:
Note: Current OpKey implementation does not include some of these hash fields, but it will.
- op_width -> operation width
- op_height -> operation height
- op_datatype -> type of image data (numbers of channels used per pixel)
- op_type_hash -> return value of C++ typeid().hash_code() for current operation class type
- op_state_hash -> combined hash of current operation op_type_hash and all its parameters hashes.
- op_tree_type_hash -> combined hash of current operation op_type_hash and all its input operations recursively until the last input in the tree.
- op_tree_state_hash -> combined hash of current operation (op_type_hash + operation parameters hashes) and all its input operations recursively until the last input in the tree.
Every operation has an OpKey which uniquely identifies the operation node tree state in current and between compositor executions. For an OpKey collision to happen all OpKey fields must be the same for 2 operations that should render a different image result (because they have different node tree states). This is different from 2 operations in different node tree branches that have same node tree states and produce same image result. In that case, the fact that the 2 operations have the same key is what should happen and the image result for both operations will be written only once. The following video demonstrates this last case in the working implementation (when duplicating the nodes with same parameters it should take more time on rendering if they were written twice):
For the system to be updated correctly on any change, it's required to implement the "hashParams" method on all operations and call in it "hashParam(param)" on all parameters of an operation that if changed would produce a change in the operation image result. Otherwise the system won't be updated on changing that unhashed parameter because output views are cached and they would have the same OpKey than previous execution. So you'd easily realize sooner or later.
OpKey collisions probabilities
I've found this website that shows a simple table of hash collisions probabilities, assuming the hash method is very good:
Let's say OpKey would only have the "op_tree_state_hash" field which is what really represents the operation tree state (the other hashes are added for making it much harder for an OpKey collision to happen):
Example 1: No cache nodes. Compositor tree with 950 nodes and an average of 2 operations per node = 1900 OpKeys. The closer value in the table for 64 bits is 1921 hashes -> 1 bad render per 10 american trillion renders (10 european billion renders)
Example 2: With cache nodes and disk cache. Compositor tree with 950 nodes and an average of 2 operations per node + 600 thousand disk cache files = 601900 OpKeys. The closer value in the table for 64 bits is 607401 hashes -> 1 bad render per 100 million renders
But for an OpKey collision to happen the other hashes would have to be the same too, so chances of collision are much much lower.
Let's get in the worst case scenario. The user has gotten a bad render because of an OpKey collision. If he realizes about the bad render, he would probably try to select the frame where it occurred and try to re-render it. For some inputs it would automatically produce different hashes as RenderLayers node does because it re-renders the scene again, that would fix the frame. If doesn't fix it, he may try to tweak anything in the compositor tree and fixed. Even when he doesn't know why it has happened, it's very possible that he may try to do these things.
Implementing image algorithms in operations
- In current blender implementation when implementing an operation you have to take into account many things because the system is tile-based, many operations may be implemented with per pixel methods and not all operations are buffered. As many image algorithms need more context than just the current pixel or tile, there isn't a ubiquitous way on how operations must be implemented for any given image algorithm, it depends on the needs of the image algorithm. So when you need more context than just the current pixel, many more methods are involved. For example for initializing tile data and telling the area you need to read from input operations so that they are buffered.
- In this new system you just need to implement execPixels() where you first get the whole buffers of all the input operations you need to read by calling getPixels on them. Then you may define a lambda function with the image algorithm that will receive all rectangles that must be written one by one on multithreading. In case of an operation that has Compute (OpenCL) support, instead of doing a cpu function, you define a kernel method in the same cpp file which can be executed either by the CPU as c++ code or as a OpenCL kernel by GPU when is available and enabled. As in the previous system you still need to implement the determineResolution method when the operation have a predetermined resolution. On cpu writing you receive a WriteRectContext which you can use to check the total number of rects and the current pass you are in. This is useful for complicated algorithms in which you may need several passes to do precalculations before writing, you may override the method getNPasses() (default is 1) to tell how many passes you want. See ToneMapOperation as an example. If the operation doesn't support computing because it can't be implemented in a kernel, you have to override the method isComputed() and return false, by default is true as I'm trying to implement as many operations that can be executed with OpenCL as possible. In this branch there are already implemented a lot of operations that can be executed with OpenCL which in current blender compositor aren't. At the end of execPixels() you always have to either call cpuWriteSeek or computeWriteSeek depending if operation is computed or not, so that the write operation is executed. The call is synchronous.
Implementing compatible CPU/GPU code for computing systems (OpenCL right now)
- As you may have seen in operations kernel code, it's using some macros. It's needed for abstracting between CPU (c++) code and the computing system code (OpenCL). It helps to simplify the implementation too as you always work with image coordinates and don't need to know about buffer offsets or paddings. Other reason is that the image being read may come from a SingleElemOperation and would contain a single pixel, the macros are aware of this and they just read always that pixel. Thanks to this abstractions up until now the code is 100% shared between C++ and OpenCL for operations that support computing. Taking into account the quantity of operations being implemented, it helps maintainability a lot.
- For kernel abstractions I took code from cycles. I've been modifying a lot of it as I was seeing the needs of the compositor so even if it seems I'm duplicating a lot of code it's not that much.
- As everything is very abstracted it should be possible to add other computing systems in the future like CUDA without much work but I don't think is a priority.
- I did a tool "defmerger" similar to "datatoc", so that I can write each kernel in the same cpp file as the operation. I just surround the kernel code by "#define OPENCL_CODE" and "#undef OPENCL_CODE". Later defmerger will look through all the operations files for the code between OPENCL_CODE tags and merge it into a single file or string. The header "#include "COM_kernel_opencl.h"" is added and it uses a method I took from from cycles that resolves all the includes and preprocessing stuff so that is ready for reading by OpenCL. The tool also allows to write specific code for only C++ or OpenCL in between kernel code, but I never needed it and should be avoided. It would be like this:
Now there is the possibility of using vectors. Vectors types implementation taken from Cycles, just modified or added what was needed. I've been vectorizing operations code where possible, not only for improving performance, many times makes the code more simple and readable.
- Now any kind of sampling is always done over the result of the operation being read. In current blender implementation, due to not all operations being buffered, sampling is done over the last buffered operation, which may be the last operation (the operation being read) or not. It affects very little to the output result but it probably does slightly. To do sampling over an operation behind the operation you want to sample and execute the algorithm of the operation being read over it, it's not desirable I think. And the behavior wasn't consistent, depended on which operations were buffered. So now everything behaves more consistently, as all operations are buffered.
- But as now all operations are buffered, simple distort operations that in current blender implementation need sampling but are not buffered (scale and rotate only) cannot be concatenated one after another without sampling being applied for each one of the operations. So every time you use a scale or rotate node the operation takes effect immediately and is sampled. Depending on the point of view this might be a disadvantage or an advantage, but I think a new/non-expert user would always expect that the operation takes effect immediately. It can be very confusing that you scale down and then scale up and you get the exact same image quality, if you insert in between non buffered nodes it'll be the same, but if you insert a buffered node suddenly the image appears pixelated (of course he doesn't know which ones are buffered nor what is a buffer). Some users might understand the behaviour but most of the time is confusing for the average user.
- I implemented the transform node as a single operation so that rotate and scale can be combined in a single operation and sampling is only done once -> TransformOperation
Note: Current blender implemention uses sampling for TranslateOperation too. I'm not using sampling for that operation in this implementation as I don't think it's needed float precision, it would cause more harm than good. Specially since this operation is automatically used for centering images when mixing 2 images in an operation.
Changes / New features
- Any data sockets: I created a new socket type (green), it just intends to indicate to the user that he can input any kind of image data (1 (gray),3 (purple) or 4 channels (yellow)) and it won't be converted. For output sockets it means that it will be same type of data as the main input socket. An example is the Scale node which it doesn't matter how many channels the data has, you just want to resize image.
- Cache Node: You can place this node anywhere in the tree and the previous operation result is cached, if you modify a node ahead of this node it calculates everything from this point only. If you modify a parameter or the tree structure behind the Cache Node it will automatically recalculate everything is behind and cache it. It behaves as both memory and disk cache. Firstly it will cache to memory, if memory cache fills up to the limit set by the user, it will discard oldest used caches as needed and save them to disk cache if available (otherwise delete them). Disk cache behaves in the same way, just oldest caches are always deleted. If user wants to use only disk cache, he may just enable it in preferences and set memory limit to 0. No compression at the moment, probably an option will be added in the future.
- Previews and Viewers are now cached: It's necessary to do this because UI may have glitches or just calling the compositor execution when it really don't need update as in fact happens. For example if you disconnect a socket by pressing without releasing and connect it again in the same place it calls the compositor to recalculate everything when it's not necessary. So now if such thing happens, the compositor operations hashes would be exactly the same so it just returns the cached previews and viewers very fast.
- Option to change Preview Quality: Previously previews were always 140 pixels, if you zoomed in or increased the size of the nodes you would see pixelated previews. Setting previews to high quality affects almost nothing to performance. Right now you may choose:
- Low Quality = 150 pixels (default)
- Medium Quality = 300 pixels
- High Quality = 450 pixels
- Option to Scale Inputs down: This option is a fast way to reduce the size of inputs (images, renders, textures, masks, video clips...). Most of the time user don't need them to be the original size, only when going to render the final result. So when working and testing different parameters in the nodes instead of zooming out the view, user may try to scale down inputs with this option because it will increase the performance a lot and at the same time reduce the size of the viewer. It affects to the resolution of all the nodes from the input to the output. It needs all the nodes values to work with relative values to produce same result as in 1.0 scale. So for nodes with operations with absolute values or with pixel based effects, scaling down will produce a more heavy effect. These issues are meant to be resolved once I implement relative space so that all values are converted to relative space and then scaling down should produce more approximate results to full resolution but it will never be perfect.
- Video Sequencer Node ? : Allows you to use any video sequencer channel as compositor input. It may seems illogical as VSE goes after the compositor in the render pipeline. But many people find it very useful. And it's faster than movie clip node because it benefits from VSE cache. But as of now it's not possible that this could make it into blender because VSE calls the render pipeline when using scene strips and that causes a dead lock if Video Sequencer Node is used. I'm circumventing the issue by calling a non multi-threading ready method, but that's obviously not good, causes OpenGL drawing errors in VSE previews. If VSE instead of calling the render pipeline in main thread launched an asynchronous job and gets the results from files, I think it should be possible and UI wouldn't hang that much when using scene strips in rendered mode. But I don't know much about this area, just trying to find a solution.
- Pixelate Node: Previously this node required the user to surround it with scale nodes with inverse values to get a desirable effect . Now a size option has been added, no need to surround it with scale nodes anymore.
Removed options from UI
- Buffers groups: Not needed anymore, as now all the operations are buffered.
- Chunk size: Now how operations writing is divided is implementation defined (depending on the number of threads system can execute at full performance and best work group size for GPU devices). This how it must be since the user shouldn't care about these things.
- Two pass: This option skipped the execution of nodes that could take too much time and low priority outputs (viewers and previews I guess) on first pass, and then executed again the whole tree on second pass but with the nodes that took a lot of time. I don't think this is needed anymore, mainly because of cache nodes, inputs scale option and just performance in general has improved. It's better to just show the final image as soon as possible. If I try to keep this option it would imply some handicaps to the implementation, specially for caching.
The tests are done with a 1920 x 1080 image.
Operating system: Windows-10-10.0.18362-SP0 64 Bits
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Graphics card: GeForce GTX 1060 6GB/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38
Test 1 - A bunch of nodes that are not currently buffered in blender compositor but in compositor-up all of them are. With CPU:
- Blender 2.91 master: 1.15 seconds / 221.62 MB (peak memory)
- Blender 2.91 compositor-up: 0.36 seconds / 309.31 MB (peak memory)
Even when there is no buffered operation other that inputs and outputs in current blender implementation and in compositor-up all operations are buffered, there is no big difference in memory. And that's because buffer recycling it's actually working, otherwise peak memory would be much worse. Doesn't matter how many nodes you add linearly, they are recycled all the time. Peak memory would depend more on how much branching there is from operations outputs because the BufferManager has to keep the buffer until all the branches finish reading the operation result. So it has to keep it longer, and it may need more memory.
Test 2 - A few nodes that are most of them buffered in current blender compositor, this nodes have more complex operations. With CPU:
- Blender 2.91 master: 1.66 seconds / 724.65 MB (peak memory)
- Blender 2.91 compositor-up: 0.72 seconds / 305.82 MB (peak memory)
In compositor-up memory usage is about the same as test 1 for complex operations as they are recycled in the same way. Current blender implementation keeps the buffers from start to end of the execution so the more nodes you add with complex operations the more memory it consumes.
Test 3 - Same as test 2 but with OpenCL:
- Blender 2.91 master: 16.45 seconds / 763.45 MB (peak memory)
- Blender 2.91 compositor-up: 0.21 seconds / 153.97 MB (peak memory)
About the time, I know very few nodes are implemented with OpenCL in current blender implementation, but there is clearly something wrong. It may happen when mixing nodes that can be executed with OpenCL and others that don't. Bokeh blur I'm sure it's implemented with OpenCL and I think it works fine if you use only that node.
About the peak memory RAM in compositor-up it's because now is mostly using and recycling GPU memory in the same way as was done before for Memory RAM, so it needs much less RAM.
In compositor-up every time you enable OpenCL it'll take some time to do the first render because it has to load the OpenCL program, but from there it should go smooth. It's in my TODO list to make it compile and save it in cache.
As you may have noticed, image results are slightly different, I'm pretty sure is related to sampling. In test2 too but even less noticeable. As I've already explained current blender implementation sampling is done over last buffered operation which may be the last operation or not, and in compositor-up is always last operation. That could be part of it, but in this test3 other thing is that I'm using native OpenCL sampling and the algorithm or precision may be slightly different. I may just use always blender sampling implementation for both CPU and GPU in the future to avoid differences. My intention is to do proper render tests per node, compare it with current blender implementation and fix differences where possible.
- Fix cache node not working for complex trees when placed in several branches.
- Implement relative space.
- Clean code removing prototyping stuff, as GlobalManager, and pass managers and CompositorContext as arguments.
- Use more blender library methods where appropriate, as for example, for hashing.
- Make render tests and OpKey collisions tests.
GitHub repository: https://github.com/m-castilla/blender-compositor-up