Page MenuHome

FFmpeg: Improve multi-threading settings
AcceptedPublic

Authored by Sergey Sharybin (sergey) on Aug 19 2020, 4:27 PM.

Details

Summary

Allow use all system threads for frame encoding/decoding. This is very
straightforward: the value of zero basically disables threading.

Change threading policy to slice when decoding frames. The reason for
this is because decoding happens frame-by-frame, so inter-frame threading
policy will not bring any speedup.

The change for threading policy to slice is less obvious and is based on
benchmark of the demo files from T78986. This gives best performance so
far.

Rendering the following file went down from 190sec down to 160sec.

https://storage.googleapis.com/institute-storage/vse_simplified_example.zip

This change makes both reading and writing faster. The animation render
is just easiest to get actual time metrics.

Diff Detail

Repository
rB Blender
Branch
ffmpeg_threads (branched from master)
Build Status
Buildable 9642
Build 9642: arc lint + arc unit

Event Timeline

Sergey Sharybin (sergey) requested review of this revision.Aug 19 2020, 4:27 PM
Sergey Sharybin (sergey) created this revision.
This revision is now accepted and ready to land.Aug 19 2020, 4:53 PM

Could this also affect playback performance of cached video? Sounds strange, but here's before the patch and after:

(Could proxy generation also benefit of this finding?)

Playback of cached videos does not use the changed codepaths. If there is difference in playback speed before/after this patch with cached videos there is something else is involved (CPU performance governor, turbo-bossting, or something like this). On Linux I always force governor to performance before making such tests, to make sure dynamic frequency scaling does not affect the results.

The proxies generation from video file should also benefit from this change. Basically, any area which needs to read frames from movie. The proxies encoding is currently done in a different codepath, so it is not affected by this change.

Basically, any area which needs to read frames from movie.

I guess the difference shown in the gif could be a result of "Prefetch Frames" is on, so it is still reading from disk(but ahead).

When talking proxies, which are in MJPEG, I think Libav has an option for hardware improved decoding of MJPEG, but not encoding, which maybe could be enabled? Whereas h.264(in an .avi container), has hardware improved decoding, encoding, specials for NVIDIA cards etc.

I don't want to sound pessimistic, but I tested this with https://storage.googleapis.com/institute-storage/vse_simplified_example.zip

And results are:
no patch 36fps ~13% CPU usage 5:10 render time
with patch 36fps ~13% CPU usage 5:10 render time

So I see no improvement here. I have 8 CPU cores, so I should produce measurable difference I guess.

There are different bottlenecks in the playback. I did see some playback improvements, but my setup is different.
The patch is visible with doing transcoding, aka animation rendering F12.

There are different bottlenecks in the playback. I did see some playback improvements, but my setup is different.
The patch is visible with doing transcoding, aka animation rendering F12.

I did do transcoding, in both cases resulting in exact same render time. Playback speed is just a "control" data point. With different playback speed, whole test would be useless.
Even if I make sure I feed encoder with a lot of data quickly, I see no differnce.

Though I am not sure, that statement "the value of zero basically disables threading." is correct (c->thread_count) . this was implemented in D4031 and claim is that value of 0 is automatic. Ffmpeg docs doesn't say anything in particular either. I didn't look in code to confirm. But when I set c->thread_count to 1, I get 30% worse performance.

-threads 0 is supposedly the "optimal" number of threads, which doesn't equal the max number of threads. Here's an analysis of the -threads setting(for streaming):
https://streaminglearningcenter.com/blogs/ffmpeg-command-threads-how-it-affects-quality-and-performance.html

-threads 0 is supposedly the "optimal"

Clearly, this is not the case on my system, and (based on D8659) on Peter's in some cases as well.

Since there is no downsides of being explicit on number of threads and threading model, I think we should do it from Blender side, and not rely on some magic number from FFmpeg.