I have been seeking ways to implement some simple Blender video sequencer editor (VSE) speed improvements and as a WIP proof of concept I created a patch that adds a new Blender module: Blender Processing Library (aka. BPL for now, but feel free to suggest something better).
It is a new module designed to efficiently process data (at the moment 2D-images or 1D-buffers). In general I would expect it to be used for filtering images (blurs and other effects) and perform other bulk processing (color conversions etc.). The usual stuff. As a first step I implemented a gaussian blur with this and compared it to the current VSE gaussian blur implementation. One idea would be to start moving other VSE effects into this processing lib. Some basic benchmarks:
Render: 45 frame 1280x720 image sequence (PNG's, floating point formats are slower)
Blur radius: 50x50 (used a largish radius so relative speed differences are a bit easier to see)
VSE cache set to 1 megabyte in settings to avoid using it
Average blur time per frame in seconds, timed using PIL_check_seconds_timer() (check the console output):
CPU: Intel i7-4790K 4.0GHz (quad core)
GPU: NVIDIA GTX 970
VSE original: 0.19s
BPL CPU: 0.12s (roughly 30% faster)
BPL OpenCL: 0.01s (19 times faster)
CPU: AMD E2-1800 1.7GHz (dual core, super cheap laptop, just idling almost maxes out the CPU meter...)
GPU: AMD Radeon 7340 (integrated)
VSE original: 2.5s
BPL CPU: 2.8s (roughly 10% slower)
BPL OpenCL: 0.18s (14 times faster)
OpenCL clearly beats both CPU versions by a huge margin, but overall I'm satisfied with the BPL CPU performance in this case, especially as it is running generic C++/OpenCL kernels against a hand-crafted C-blur (in seqeffects.c). Both the BPL CPU version and Blender original blur use the Blender threading API to process image data in parallel, but it can be abstracted away better in BPL case. The BPL CPU version is also super slow in debug builds because the template inlining and other compiler optimizations are off. In practice only the OpenCL version can calculate large radius blurs during the playback (24 FPS) without stuttering. The heavier the operation, the bigger the win for the OpenCL. Forcing the sequencer to work with floats (or if the sequencer contains float images) slows things down even with OpenCL. Seems like float memory copying and/or conversion overhead on CPU is starting to show as increasing the blur effect radius has very little effect on OpenCL speed but bogs down both CPU versions. I also tried to blur some 4k resolution images and OpenCL version was usually around 8 to 10 times faster than the original blur. When using smaller blur radius like 10 instead of 50 OpenCL is usually about 5-6 times faster than the original.
To test the patch, apply the .diff in master branch. I use Windows and Visual Studio 2017, but the patch should work on other platforms too. Some things might need at least C++11 compatible compiler, is this okay? Not sure if it compiles on VS 2013, hopefully it does... After building, there should be a new setting in the user preferences System tab: "Processing device". Legacy is the old, unmodified behaviour. CPU and OpenCL are the new ones. Only "Gaussian Blur" effect in the VSE supports this setting at the moment. Some performance timings are printed into console after each blur operation.
I listed some of my thoughts and ramblings below, feel free to comment any of them.
- Are there any bigger plans already for VSE (or overall) that might conflict with this?
- The idea is to write the algorithm pixel kernel only once and run the same code on CPU or OpenCL. Silently fall back to CPU if the GPU is not supported or fails (out of device memory etc.). Currently the only filter implemented is the gaussian blur, but it is pretty easy to add more. As a note there are at least four image gaussian blur implementations at the moment: compositor normal, compositor SSE2 version, compositor OpenCL and the threaded version in the sequencer. The glow effect has some sort of blur too, so maybe 5 gaussian blurs? Also there are several gauss table generation implementations (one with a TODO comment mentioning this fact). In an ideal world these would all use this new processing lib (modifying compositor would be a lot trickier) and we could remove at least the blur implementations in "seqeffects.c".
- The C++ CPU version essentially implements an optimized metaprogramming "environment" that transforms OpenCL kernels into optimized templated code. It also supports SSE2 if available during compile time (SSE2 seems to be required by the official Blender system requirements, is there any need to have a fallback version?). This code generation can cause some code bloat, but it is possible to adjust the balance between performance and code size. The code can also be hard to read and modify for a beginner, but it is not that many lines in total and it's relatively easy to write new kernels if you don't care about the gory implementation details hidden under the hood. Try not to have a heart attack when you see BPL_EXECUTE_IMAGE_KERNEL macro, it can be simplified a lot if we don't require absolute top performance :)
- Uses OpenCL instead of OpenGL. Large images with complicated OpenGL shaders can easily hit hardware limits. Plus OpenCL code passes as normal C-code easily, so we can use a subset that works both in GPU and CPU. Also there are some API's that can be used to share data between OpenCL and OpenGL.
- API to easily chain operations, many filters are basically just a series of simple operations run one after another. A key to good performance especially with GPU's is to keep the data in device memory. The library uses two memory buffers that can be used to ping-pong data inside GPU and read it back only when needed (usually when all operations have been performed). For now it is easy to replace individual filters and still get a big performance boost, but in a perfect world effects could be chained together if we only care about the final result. A simple code example:
// Blur an 8-bit, 4-channel (RGBA) image ProcessOperation *op = BPL_process_image_8bit(image, width, height, 4); //Start operation BPL_op_image_blur(op, 10, 10); // 10x10 blur ... // Chain other filters if wanted BPL_op_image_copy(op, image); // Read blurred image data back. Could put it into another buffer, too. BPL_end_operation(op); // End operation and free any temporary data
- Flexibility. Supports 32-bit floating point and 8-bit images with 1 to 4 color channels (from R to RGBA). Might be easier to internally use only floats and convert back as needed? Or would it be too bad for performance / memory use?
- There is some amount of OpenCL / C++ compatibility stuff in there to run the kernels on CPU, could maybe use Cycles "util_types.h" instead or something? Or is it bad to depend on semi-external libraries like that? Maybe add CUDA support at some point like in Cycles? I haven't checked yet how the code sharing is implemented in there.
- Similar to the compositor, BPL is internally written in C++ but exposes only a public C-API. "BPL_processing.h" has the basic API and it should be the only file to be used by other parts of Blender.
- Thread safety, especially when using shared OpenCL resources.
- The patch contains small amounts of copy-pasted code from other Blender modules because I have been avoiding touching any files that I don't have to, in the future they all should be merged.
- This is a rough work-in-progress patch, but if I keep working on this does this have a realistic chance of getting into Blender codebase at some point? I am willing to keep maintaining it.