Improving Video Sequencer performance with C++ and OpenCL #53374
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
14 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#53374
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have been seeking ways to implement some simple Blender video sequencer editor (VSE) speed improvements and as a WIP proof of concept I created a patch that adds a new Blender module: Blender Processing Library (aka. BPLfor now, but feel free to suggest something better).
vse_bpl_blur.diff
It is a new module designed to efficiently process data (at the moment 2D-images or 1D-buffers). In general I would expect it to be used for filtering images (blurs and other effects) and perform other bulk processing (color conversions etc.). The usual stuff. As a first step I implemented a gaussian blur with this and compared it to the current VSE gaussian blur implementation. One idea would be to start moving other VSE effects into this processing lib. Some basic benchmarks:
Render: 45 frame 1280x720 image sequence (PNG's, floating point formats are slower)
Blur radius: 50x50 (used a largish radius so relative speed differences are a bit easier to see)
VSE cache set to 1 megabyte in settings to avoid using it
Average blur time per frame in seconds, timed using PIL_check_seconds_timer() (check the console output):
CPU: Intel i7-4790K 4.0GHz (quad core)
GPU: NVIDIA GTX 970
VSE original: 0.19s
BPL CPU: 0.12s (roughly 30% faster)
BPL OpenCL: 0.01s (19 times faster)
CPU: AMD E2-1800 1.7GHz (dual core, super cheap laptop, just idling almost maxes out the CPU meter...)
GPU: AMD Radeon 7340 (integrated)
VSE original: 2.5s
BPL CPU: 2.8s (roughly 10% slower)
BPL OpenCL: 0.18s (14 times faster)
OpenCL clearly beats both CPU versions by a huge margin, but overall I'm satisfied with the BPL CPU performance in this case, especially as it is running generic C++/OpenCL kernels against a hand-crafted C-blur (in seqeffects.c). Both the BPL CPU version and Blender original blur use the Blender threading API to process image data in parallel, but it can be abstracted away better in BPL case. The BPL CPU version is also super slow in debug builds because the template inlining and other compiler optimizations are off. In practice only the OpenCL version can calculate large radius blurs during the playback (24 FPS) without stuttering. The heavier the operation, the bigger the win for the OpenCL. Forcing the sequencer to work with floats (or if the sequencer contains float images) slows things down even with OpenCL. Seems like float memory copying and/or conversion overhead on CPU is starting to show as increasing the blur effect radius has very little effect on OpenCL speed but bogs down both CPU versions. I also tried to blur some 4k resolution images and OpenCL version was usually around 8 to 10 times faster than the original blur. When using smaller blur radius like 10 instead of 50 OpenCL is usually about 5-6 times faster than the original.
To test the patch, apply the .diff in master branch. I use Windows and Visual Studio 2017, but the patch should work on other platforms too. Some things might need at least C++11 compatible compiler, is this okay? Not sure if it compiles on VS 2013, hopefully it does... After building, there should be a new setting in the user preferences System tab: "Processing device". Legacy is the old, unmodified behaviour. CPU and OpenCL are the new ones. Only "Gaussian Blur" effect in the VSE supports this setting at the moment. Some performance timings are printed into console after each blur operation.
I listed some of my thoughts and ramblings below, feel free to comment any of them.
Are there any bigger plans already for VSE (or overall) that might conflict with this?
The idea is to write the algorithm pixel kernel only once and run the same code on CPU or OpenCL. Silently fall back to CPU if the GPU is not supported or fails (out of device memory etc.). Currently the only filter implemented is the gaussian blur, but it is pretty easy to add more. As a note there are at least four image gaussian blur implementations at the moment: compositor normal, compositor SSE2 version, compositor OpenCL and the threaded version in the sequencer. The glow effect has some sort of blur too, so maybe 5 gaussian blurs? Also there are several gauss table generation implementations (one with a TODO comment mentioning this fact). In an ideal world these would all use this new processing lib (modifying compositor would be a lot trickier) and we could remove at least the blur implementations in "seqeffects.c".
The C++ CPU version essentially implements an optimized metaprogramming "environment" that transforms OpenCL kernels into optimized templated code. It also supports SSE2 if available during compile time (SSE2 seems to be required by the official Blender system requirements, is there any need to have a fallback version?). This code generation can cause some code bloat, but it is possible to adjust the balance between performance and code size. The code can also be hard to read and modify for a beginner, but it is not that many lines in total and it's relatively easy to write new kernels if you don't care about the gory implementation details hidden under the hood. Try not to have a heart attack when you see BPL_EXECUTE_IMAGE_KERNEL macro, it can be simplified a lot if we don't require absolute top performance :)
Uses OpenCL instead of OpenGL. Large images with complicated OpenGL shaders can easily hit hardware limits. Plus OpenCL code passes as normal C-code easily, so we can use a subset that works both in GPU and CPU. Also there are some API's that can be used to share data between OpenCL and OpenGL.
API to easily chain operations, many filters are basically just a series of simple operations run one after another. A key to good performance especially with GPU's is to keep the data in device memory. The library uses two memory buffers that can be used to ping-pong data inside GPU and read it back only when needed (usually when all operations have been performed). For now it is easy to replace individual filters and still get a big performance boost, but in a perfect world effects could be chained together if we only care about the final result. A simple code example:
Flexibility. Supports 32-bit floating point and 8-bit images with 1 to 4 color channels (from R to RGBA). Might be easier to internally use only floats and convert back as needed? Or would it be too bad for performance / memory use?
There is some amount of OpenCL / C++ compatibility stuff in there to run the kernels on CPU, could maybe use Cycles "util_types.h" instead or something? Or is it bad to depend on semi-external libraries like that? Maybe add CUDA support at some point like in Cycles? I haven't checked yet how the code sharing is implemented in there.
Similar to the compositor, BPL is internally written in C++ but exposes only a public C-API. "BPL_processing.h" has the basic API and it should be the only file to be used by other parts of Blender.
Thread safety, especially when using shared OpenCL resources.
The patch contains small amounts of copy-pasted code from other Blender modules because I have been avoiding touching any files that I don't have to, in the future they all should be merged.
This is a rough work-in-progress patch, but if I keep working on this does this have a realistic chance of getting into Blender codebase at some point? I am willing to keep maintaining it.
Changed status to: 'Open'
Added subscriber: @hsalo
Added subscriber: @candreacchio
Added subscriber: @mano-wii
Woow, 19 times faster!! That deserves attention
I recommend you post the diff in the session to Submit Code
So the review will be done on the patch (with in-place notations)
I also suggest you tag some of the Module Owners to review the patch ;)
Added subscriber: @Sergey
Nice to see activity in this field! Would be nice to have some common fundametals used by both compositor and sequencer.
Here goes some feedback based on reading the description (didn't have time to look into code in details).
So while it is a nice code experiment (even tho, it's not surprising single blur node with big radius will be faster on GPU) i lack seeing clear design description here how it all fits together.
Added subscriber: @ideasman42
@mano-wii Thanks. Blur is probably the best case scenario for OpenCL, many other effects probably wont gain so much speedup. Maybe glow would be another one to benefit a lot. In the wiki sequencer doesn't seem to have a module owner at all, but Sergey did already notice this.
@Sergey Good feedback, thanks. I'll check out that VSN-branch and Cycles in more depth. I guess the basic, fundamental idea would be to just get some faster effects in VSE, we don't have to aim for the moon immediately. Of course code complexity vs speed is one tradeoff here, so in the end it's up to the preferences of the Blender maintainers who would have to deal with this code.
I will probably keep tinkering with this in any case, I often use the video editor myself and the slowness of effects was always a bit annoying when compared to many other video editors. Sometimes it can be hard to gauge how the effects and their timing actually looks in the final result because of the big drop in FPS. So if nothing else, at least I have a personal build of Blender with some quick video effects and code available for anyone else who wants to try it :)
I'll add @ideasman42 as subscriber (?) to this task in case he wants to see this, he seems to be another VSE developer member.
Added subscriber: @Blendify
Added subscriber: @TobiaszunfaKaron
Anything that can increase VSE performance is a huge blessing. Thank you!
Added subscriber: @tintwotin
Great project! Are you aware of the Movit-lib which generates video effects on the GPU? https:*git.sesse.net/?p=movit;a=summary
@tintwotin Looks interesting, although it's using OpenGL instead of OpenCL. Fortunately many algorithms are (often) easy to port between the two in case someone wants to.
Some minor progress. I picked the next very low-hanging performance fruit and tried to optimize the glow-effect. It was not multithreaded at all like the Gaussian Blur was, so its baseline performance was not optimal to begin with. The glow blur function also had a worrying comment: "Watch out though, it tends to misbehaven with large blur values on a small bitmap. Avoid avoid avoid.". Not sure if I faithfully implemented the bugs, too... At least visual results seem to be identical after some quick tests.
Sergey was right about needing more complex inputs, so I revamped the BPL API quite a bit and implemented image premultiply, unpremultiply, highlight isolation and color blend operations (only addition at the moment, but others are easy to add). As a bonus both gaussian blur effect and glow can now use the same BPL blur implementation.
Here are some very rough "benchmarks" versus the default Glow implementation, many effect settings can affect these, but these are the rough averages in a "normal" use case. Test machine was Intel i7 4.0MHz quad core, render resolution 1280x720.
8-bit RGBA:
BPL CPU: practically identical speed compared to original
BPL OpenCL: roughly 20-23 times faster
32-bit floats:
BPL CPU: roughly 20% faster (thanks to the OpenCL compatibility layer SSE2 support, slower without it)
BPL OpenCL: roughly 10-12 times faster (relative performance halves, I wonder how half floats would fare...)
As a minor detail old VSE glow effect always converts images to floats if they are not already, but BPL can operate on either format. Of course clipping and other things may become big issues with 8-bit values. Still, it could speed things up a lot if we don't force the use of floats... OpenCL also makes it possible to stream data asynchronously to and from the GPU, so with multiple input images/strips it might be possible to upload them concurrently (assuming there is enough bandwidth to make it worthwhile). Maybe a prepass during every frame that first starts uploading relevant images to GPU concurrently and only then starts operating on them. Could really kick the memory usage up, but this is a video editor after all...
Those two effects (Gaussian blur and Glow) were probably the two easiest things to speed up individually. The harder part will be to first see what is slow in the sequencer (file reading, decoding, strip operations, effects, drawing? etc.) and possibly try to optimize the whole pipeline or at least some slowest parts of it. It certainly doesn't look too easy, but there might be some easy pickings.
Here is the Glow effect using the updated BPL API in case anyone is curious, but things will probably still change. It uses three buffers in total and moves the data between them.
Added subscriber: @troy_s
OCIO version 2.0 will have CL support.
@hsalo if you are interested, can you message me with your email and I'll add you to the Blender VSE Slack channel to discuss things?
Perhaps we can get a bit of a collective document together and forge the VSE into a contemporary NLE.
Added subscriber: @Darkfie9825
Thank you so much for this work! I love the concept of a unified and optimized image processing library used wherever needed in Blender.
Just curious: Have you considered, while you're on the topic of VSE performance, some kind of more user-friendly caching ('P' to prefetch) mechanism for the VSE - something like you'd find in After Effects (and to a lesser extent the motion tracker - 'P' to prefetch)? Simply because smooth playback (even of unedited footage) in general seems to be one of the main issues with 2D work (editing, or compositing for that matter) in Blender.
Anecdotally, Blender has never seemed very aggressive about memory caching. I may be mistaken?
Added subscriber: @NahuelBelich
Added subscriber: @tmb
What do you think about using Halide (http://halide-lang.org/) as an abstraction layer for the hardware vectorization for BPL? I don't have any experience with it yet but it seems like a nice way to target CPUs, CUDA, OpenCL in a joint way.
Added subscriber: @michaellackner
Added subscriber: @kjy-4
Added subscriber: @iss
Changed status from 'Open' to: 'Archived'
moving as TODO to #64682 (Video Sequence Editor (Sequencer) Module)