Page MenuHome

Improving Video Sequencer performance with C++ and OpenCL
Open, NormalPublic

Tokens
"Love" token, awarded by Fulk33."Love" token, awarded by miclack."Love" token, awarded by tintwotin."Love" token, awarded by Darkfie9825."Love" token, awarded by pauloup.
Assigned To
None
Authored By
Heikki Salo (hsalo), Nov 22 2017

Description

I have been seeking ways to implement some simple Blender video sequencer editor (VSE) speed improvements and as a WIP proof of concept I created a patch that adds a new Blender module: Blender Processing Library (aka. BPL for now, but feel free to suggest something better).

It is a new module designed to efficiently process data (at the moment 2D-images or 1D-buffers). In general I would expect it to be used for filtering images (blurs and other effects) and perform other bulk processing (color conversions etc.). The usual stuff. As a first step I implemented a gaussian blur with this and compared it to the current VSE gaussian blur implementation. One idea would be to start moving other VSE effects into this processing lib. Some basic benchmarks:

Render: 45 frame 1280x720 image sequence (PNG's, floating point formats are slower)
Blur radius: 50x50 (used a largish radius so relative speed differences are a bit easier to see)
VSE cache set to 1 megabyte in settings to avoid using it
Average blur time per frame in seconds, timed using PIL_check_seconds_timer() (check the console output):

CPU: Intel i7-4790K 4.0GHz (quad core)
GPU: NVIDIA GTX 970
VSE original: 0.19s
BPL CPU: 0.12s (roughly 30% faster)
BPL OpenCL: 0.01s (19 times faster)

CPU: AMD E2-1800 1.7GHz (dual core, super cheap laptop, just idling almost maxes out the CPU meter...)
GPU: AMD Radeon 7340 (integrated)
VSE original: 2.5s
BPL CPU: 2.8s (roughly 10% slower)
BPL OpenCL: 0.18s (14 times faster)

OpenCL clearly beats both CPU versions by a huge margin, but overall I'm satisfied with the BPL CPU performance in this case, especially as it is running generic C++/OpenCL kernels against a hand-crafted C-blur (in seqeffects.c). Both the BPL CPU version and Blender original blur use the Blender threading API to process image data in parallel, but it can be abstracted away better in BPL case. The BPL CPU version is also super slow in debug builds because the template inlining and other compiler optimizations are off. In practice only the OpenCL version can calculate large radius blurs during the playback (24 FPS) without stuttering. The heavier the operation, the bigger the win for the OpenCL. Forcing the sequencer to work with floats (or if the sequencer contains float images) slows things down even with OpenCL. Seems like float memory copying and/or conversion overhead on CPU is starting to show as increasing the blur effect radius has very little effect on OpenCL speed but bogs down both CPU versions. I also tried to blur some 4k resolution images and OpenCL version was usually around 8 to 10 times faster than the original blur. When using smaller blur radius like 10 instead of 50 OpenCL is usually about 5-6 times faster than the original.

To test the patch, apply the .diff in master branch. I use Windows and Visual Studio 2017, but the patch should work on other platforms too. Some things might need at least C++11 compatible compiler, is this okay? Not sure if it compiles on VS 2013, hopefully it does... After building, there should be a new setting in the user preferences System tab: "Processing device". Legacy is the old, unmodified behaviour. CPU and OpenCL are the new ones. Only "Gaussian Blur" effect in the VSE supports this setting at the moment. Some performance timings are printed into console after each blur operation.

I listed some of my thoughts and ramblings below, feel free to comment any of them.

  1. Are there any bigger plans already for VSE (or overall) that might conflict with this?
  1. The idea is to write the algorithm pixel kernel only once and run the same code on CPU or OpenCL. Silently fall back to CPU if the GPU is not supported or fails (out of device memory etc.). Currently the only filter implemented is the gaussian blur, but it is pretty easy to add more. As a note there are at least four image gaussian blur implementations at the moment: compositor normal, compositor SSE2 version, compositor OpenCL and the threaded version in the sequencer. The glow effect has some sort of blur too, so maybe 5 gaussian blurs? Also there are several gauss table generation implementations (one with a TODO comment mentioning this fact). In an ideal world these would all use this new processing lib (modifying compositor would be a lot trickier) and we could remove at least the blur implementations in "seqeffects.c".
  1. The C++ CPU version essentially implements an optimized metaprogramming "environment" that transforms OpenCL kernels into optimized templated code. It also supports SSE2 if available during compile time (SSE2 seems to be required by the official Blender system requirements, is there any need to have a fallback version?). This code generation can cause some code bloat, but it is possible to adjust the balance between performance and code size. The code can also be hard to read and modify for a beginner, but it is not that many lines in total and it's relatively easy to write new kernels if you don't care about the gory implementation details hidden under the hood. Try not to have a heart attack when you see BPL_EXECUTE_IMAGE_KERNEL macro, it can be simplified a lot if we don't require absolute top performance :)
  1. Uses OpenCL instead of OpenGL. Large images with complicated OpenGL shaders can easily hit hardware limits. Plus OpenCL code passes as normal C-code easily, so we can use a subset that works both in GPU and CPU. Also there are some API's that can be used to share data between OpenCL and OpenGL.
  1. API to easily chain operations, many filters are basically just a series of simple operations run one after another. A key to good performance especially with GPU's is to keep the data in device memory. The library uses two memory buffers that can be used to ping-pong data inside GPU and read it back only when needed (usually when all operations have been performed). For now it is easy to replace individual filters and still get a big performance boost, but in a perfect world effects could be chained together if we only care about the final result. A simple code example:
// Blur an 8-bit, 4-channel (RGBA) image
ProcessOperation *op = BPL_process_image_8bit(image, width, height, 4); //Start operation
BPL_op_image_blur(op, 10, 10); // 10x10 blur
... // Chain other filters if wanted
BPL_op_image_copy(op, image); // Read blurred image data back. Could put it into another buffer, too.
BPL_end_operation(op); // End operation and free any temporary data
  1. Flexibility. Supports 32-bit floating point and 8-bit images with 1 to 4 color channels (from R to RGBA). Might be easier to internally use only floats and convert back as needed? Or would it be too bad for performance / memory use?
  1. There is some amount of OpenCL / C++ compatibility stuff in there to run the kernels on CPU, could maybe use Cycles "util_types.h" instead or something? Or is it bad to depend on semi-external libraries like that? Maybe add CUDA support at some point like in Cycles? I haven't checked yet how the code sharing is implemented in there.
  1. Similar to the compositor, BPL is internally written in C++ but exposes only a public C-API. "BPL_processing.h" has the basic API and it should be the only file to be used by other parts of Blender.
  1. Thread safety, especially when using shared OpenCL resources.
  1. The patch contains small amounts of copy-pasted code from other Blender modules because I have been avoiding touching any files that I don't have to, in the future they all should be merged.
  1. This is a rough work-in-progress patch, but if I keep working on this does this have a realistic chance of getting into Blender codebase at some point? I am willing to keep maintaining it.

Details

Type
Patch

Event Timeline

Woow, 19 times faster!! That deserves attention

I recommend you post the diff in the session to Submit Code
So the review will be done on the patch (with in-place notations)

I also suggest you tag some of the Module Owners to review the patch ;)

Sergey Sharybin (sergey) triaged this task as Normal priority.Nov 22 2017, 3:35 PM

Nice to see activity in this field! Would be nice to have some common fundametals used by both compositor and sequencer.

Here goes some feedback based on reading the description (didn't have time to look into code in details).

  • There was really similar project back in 2013. See this soc-2013-vse branch in SVN.
  • Chaining filters together will not work for real life files, you need to allow more complex dependencies, supporting multiple inputs for filters and such. As a limiting factor here, you don't probably want to have ALL buffers on the device all the time, and you don't want to have roundtrips of data between the device and host sides.
  • In order to have real-time playback, it is really not enough to simply use OpenCL/whatever. The system needs to become aware of both lower resolution display (when seeing hi-res result on a zoomed-out matter), or when having part of the really high-res image zoomed in a lot.
  • Color space conversion is quite tricky, since you need to fully support OCIO pipeline, which only have OpenGL bindings. Maybe it's possible to write OpenCL bindings with already existing API already tho.
  • I'm not sure it's a right decision to start with GPU support. Surely, it's nice to keep that in mind, but this approach has limitations and even on bare CPU side there is so much which can be done. With AVX2 already existing on lots of artists CPUs, you can vectorize all the math really well. Surely, that would require some run-time detection of microarchitecture. The advantage is that this way you've got easier memory management, and potentially even out-of-core computation (which could become essential with VR and such).
  • As a continue of previous topic, did you check Cycles sources? It has all the math functions and such abstracted from the compute device, and works for all CUDA/OpenCL/CPU. Totally worth checking this, to save possibly duplicated work being done.
  • Another related thing to this discussion is how to mix CPU only code with code which potentially can run on GPU (keep in mind, some algorithms might not be suitable for GPU).
  • For simple alpha-over and color correction of 32 bit float values it could be faster to run everything on CPU (pushing data to the device will kill all the benefits). So how to keep things in a way that simply works for artists?

So while it is a nice code experiment (even tho, it's not surprising single blur node with big radius will be faster on GPU) i lack seeing clear design description here how it all fits together.

@Germano Cavalcante (mano-wii) Thanks. Blur is probably the best case scenario for OpenCL, many other effects probably wont gain so much speedup. Maybe glow would be another one to benefit a lot. In the wiki sequencer doesn't seem to have a module owner at all, but Sergey did already notice this.

@Sergey Sharybin (sergey) Good feedback, thanks. I'll check out that VSN-branch and Cycles in more depth. I guess the basic, fundamental idea would be to just get some faster effects in VSE, we don't have to aim for the moon immediately. Of course code complexity vs speed is one tradeoff here, so in the end it's up to the preferences of the Blender maintainers who would have to deal with this code.

I will probably keep tinkering with this in any case, I often use the video editor myself and the slowness of effects was always a bit annoying when compared to many other video editors. Sometimes it can be hard to gauge how the effects and their timing actually looks in the final result because of the big drop in FPS. So if nothing else, at least I have a personal build of Blender with some quick video effects and code available for anyone else who wants to try it :)

I'll add @Campbell Barton (campbellbarton) as subscriber (?) to this task in case he wants to see this, he seems to be another VSE developer member.

Anything that can increase VSE performance is a huge blessing. Thank you!

Great project! Are you aware of the Movit-lib which generates video effects on the GPU? https://git.sesse.net/?p=movit;a=summary

@Peter Fog (tintwotin) Looks interesting, although it's using OpenGL instead of OpenCL. Fortunately many algorithms are (often) easy to port between the two in case someone wants to.

Some minor progress. I picked the next very low-hanging performance fruit and tried to optimize the glow-effect. It was not multithreaded at all like the Gaussian Blur was, so its baseline performance was not optimal to begin with. The glow blur function also had a worrying comment: "Watch out though, it tends to misbehaven with large blur values on a small bitmap. Avoid avoid avoid.". Not sure if I faithfully implemented the bugs, too... At least visual results seem to be identical after some quick tests.

Sergey was right about needing more complex inputs, so I revamped the BPL API quite a bit and implemented image premultiply, unpremultiply, highlight isolation and color blend operations (only addition at the moment, but others are easy to add). As a bonus both gaussian blur effect and glow can now use the same BPL blur implementation.

Here are some very rough "benchmarks" versus the default Glow implementation, many effect settings can affect these, but these are the rough averages in a "normal" use case. Test machine was Intel i7 4.0MHz quad core, render resolution 1280x720.

8-bit RGBA:
BPL CPU: practically identical speed compared to original
BPL OpenCL: roughly 20-23 times faster

32-bit floats:
BPL CPU: roughly 20% faster (thanks to the OpenCL compatibility layer SSE2 support, slower without it)
BPL OpenCL: roughly 10-12 times faster (relative performance halves, I wonder how half floats would fare...)

As a minor detail old VSE glow effect always converts images to floats if they are not already, but BPL can operate on either format. Of course clipping and other things may become big issues with 8-bit values. Still, it could speed things up a lot if we don't force the use of floats... OpenCL also makes it possible to stream data asynchronously to and from the GPU, so with multiple input images/strips it might be possible to upload them concurrently (assuming there is enough bandwidth to make it worthwhile). Maybe a prepass during every frame that first starts uploading relevant images to GPU concurrently and only then starts operating on them. Could really kick the memory usage up, but this is a video editor after all...

Those two effects (Gaussian blur and Glow) were probably the two easiest things to speed up individually. The harder part will be to first see what is slow in the sequencer (file reading, decoding, strip operations, effects, drawing? etc.) and possibly try to optimize the whole pipeline or at least some slowest parts of it. It certainly doesn't look too easy, but there might be some easy pickings.

Here is the Glow effect using the updated BPL API in case anyone is curious, but things will probably still change. It uses three buffers in total and moves the data between them.

ProcessOperation *op = BPL_create_image_operation();
int a = BPL_op_image_alloc_from_data(op, ibuf1->x, ibuf1->y, format, ibuf1->channels, src);
int b = BPL_op_image_alloc_empty(op, ibuf1->x, ibuf1->y, override_format, ibuf1->channels);
int c = BPL_op_image_alloc_empty(op, ibuf1->x, ibuf1->y, override_format, ibuf1->channels);

BPL_op_image_premultiply(op, a, b); //Premultiply image a to b. (TODO Only use with bytes)
BPL_op_image_isolate_highlights(op, b, c, glow->fMini * 3.0f, glow->fBoost * facf0, glow->fClamp);
BPL_op_image_blur_horizontal(op, c, b, blur_distance);
BPL_op_image_blur_vertical(op, b, c, blur_distance);

int from = c;
if (!glow->bNoComp) {
    BPL_op_image_blend(op, c, a, b, BPL_BLEND_ADD, 1.0f); //Add highlights: (c + a) -> b, fully blended (1.0f)
    from = b;
}
BPL_op_image_unpremultiply(op, from, a);
BPL_op_image_read(op, a, dst); //Read processed image back
BPL_end_operation(op);

OCIO version 2.0 will have CL support.

@Heikki Salo (hsalo) if you are interested, can you message me with your email and I'll add you to the Blender VSE Slack channel to discuss things?

Perhaps we can get a bit of a collective document together and forge the VSE into a contemporary NLE.

Thank you so much for this work! I love the concept of a unified and optimized image processing library used wherever needed in Blender.

Just curious: Have you considered, while you're on the topic of VSE performance, some kind of more user-friendly caching ('P' to prefetch) mechanism for the VSE - something like you'd find in After Effects (and to a lesser extent the motion tracker - 'P' to prefetch)? Simply because smooth playback (even of unedited footage) in general seems to be one of the main issues with 2D work (editing, or compositing for that matter) in Blender.

Anecdotally, Blender has never seemed very aggressive about memory caching. I may be mistaken?

What do you think about using Halide (http://halide-lang.org/) as an abstraction layer for the hardware vectorization for BPL? I don't have any experience with it yet but it seems like a nice way to target CPUs, CUDA, OpenCL in a joint way.