Page MenuHome

Speed up saving when rendering movies / Profiling walk-though

Authored by Ray molenkamp (LazyDodo) on Fri, Jun 26, 3:31 AM.
"Heartbreak" token, awarded by tintwotin."Love" token, awarded by Alaska."Love" token, awarded by EAW."Love" token, awarded by hlorus."Mountain of Wealth" token, awarded by EitanSomething."Party Time" token, awarded by harley."Love" token, awarded by jorsh.



While this diff solves an actual issue, it is written more as a guide on how to analyze performance issues.

It's a common complaint that rendering is slow and that saving takes too much time. T77848 made an attempt at profiling but kinda missed the mark, more on that later.

For testing the most basic scene we'll use a 60 frame rotating cube, written to h264in a MP4 container.

Recording the profile

For recording the profile UiForETW was used and WPA was used for analysis (UiForETW will install this if it is not installed yet)

For profiling we'd like repeatable results, what is not repeatable? the speed I need to click around in the UI to get things to render, so for this profiling session we will be doing a command line render of the file listed above.

Start recording in UiForETW and run the following command from the command line

blender --factory-startup -b "eevee_cube_mp4_test_1.blend" -o "//eevee_cube_mp4_test_1" -a

once the render is done, click the "Save Trace buffers" select the trace just created hit F2 and give it a name you may still recognize a few days from now, then double click to open up WPA.

Getting around in WPA

For WPA to give useful information, it'll need a way to go from addresses to function names, click Trace->Load Symbols to load this symbol information, this may take a "bit" the first time, it will tell you you can keep using WPA while it loads, do not believe its lies, wait until the symbol load is complete or WPA will just get stuck and you'll end up closing it and having to start over.

For this analysis we will be using the "CPU Usage (sampled)' view , the way this works is , during the recording of the profile data, 1000+ times a second the cores will be stopped and be queried for "whatcha doing?" and their stacks will be collected. the sampled view will allow us to perform various queries on this data, you can safely close the other windows.

Hold up!

There is this "Cpu Usage (Precise) window" that sounds useful, why are we getting rid of that? I like precise data! precise data sounds like it's good!

Ah yes, easy mistake to make, while it indeed does says precise it doesn't really tell you what it is precise about, while recording the trace, every-time a thread was allowed to run a copy of the stack was made when it started running and when it stopped.

The "CPU Usage (Precise)" window will give you a way to query this data, it will tell you *VERY* precise *WHEN* you are running, but since it only collects stacks at the beginning and end, it's rather useless to figure out *WHAT* you are running, while it is useful (to see it in action see D6267) it's not what we'll be using today.

Customizing the view

Once you have closed all other views switch the remaining view to "Flame by process, stack"

Well that is not the most exciting thing ever, this is because by default the graph is not visible, click on this icon to switch to the mixed graph + table view

Better but.. it's so tiny... turns out the profiler didn't just profile blender, it profiled "the system" which is great since there is a lot of data that may be relevant but for now it's a bit much, right click the blender executable and select "filter to selection" to only look at the data for blender.

allright, better but still, thats a lot of stuff (also this is where T77848 went wrong) by default this view will combine all threads, however given the OCIO code is rather well multi threaded it will end up counting much of this data double (or 4, 8 or however many cores you have) in T77848 it *DID* spend 243s in color management, but much of that was simultaneously, the user did not wait that long, so it be better to look at just the main thread, since that is closer to actual time the user will experience.

Click the little gear icon on top of the window, check the checkbox for "Thread ID" and drag is up right under "Process"

ah now we have it nicely spit by thread, since we only care about the main thread right now, look for the one that has "main" in it, right click it and filter to selection.

at this point you may end up with a view like this

one of the spikes will be so tall, it makes the whole graph hard to read, in this case, this is the python init code, we do not care for it , lets get rid of it! click near the bottom of the spike in the graph, and the table view should bring you somewhere near WM_INIT

Somewhere near it should be a line for BPY_Python_start right click it and select 'Filter OUT selection' (the out bit is important)

Allright, this we can work with!

note: that since a flame graph only tells you the proportions of where time is spend, the location of a function horizontally means nothing, and is generally rather random, your graph may have the order horizontally switched, and that's fine!

actually looking at the data

Alright 24.6 seconds is needed to render this of which 10.6 seconds is spend rendering and 13.97 seconds is spend saving, not great, not terrible, but without a doubt there is room for improvement.

Splitting it out further

(most of the mentioned cannot be seen in the screenshot above, you're gonna have to take my word for it)

Encoding a movie sounds and feels expensive, however this profile tells us of that 13.97s spend saving, only 1.63s is spend in ffmpeg, so the cost of encoding is *NOT* the most expensive process at hand here.

IMB_colormanagement_imbuf_for_write however is soaking up a solid 12.08 seconds, that's not super great.

What is taking so long?

A fair ammount of time is spend in processorApply_predivide lets see what that's doing

void OCIOImpl::processorApply_predivide(OCIO_ConstProcessorRcPtr *processor,
                                        OCIO_PackedImageDesc *img_)
  try {
    PackedImageDesc *img = (PackedImageDesc *)img_;
    int channels = img->getNumChannels();

    if (channels == 4) {
      float *pixels = img->getData();

      int width = img->getWidth();
      int height = img->getHeight();

      for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
          float *pixel = pixels + 4 * (y * width + x);

          processorApplyRGBA_predivide(processor, pixel);
    else {
      (*(ConstProcessorRcPtr *)processor)->apply(*img);
  catch (Exception &exception) {

So this will be doing a single OCIO call per pixel, yikes the call overhead on this cannot be good.

Can we optimize this?

We probably can, better is looking if we can avoid this function all together, let see how we ended up here from the profile above, we can see that IMB_colormanagement_imbuf_for_write seems to be driving this process, lets have a look!


just taking a peek inside, we find this comment which seems relevant

/* If we're saving from RGBA to RGB buffer then it's not
 * so much useful to just ignore alpha -- it leads to bad
 * artifacts especially when saving byte images.
 * What we do here is we're overlaying our image on top of
 * background color (which is currently black).
 * This is quite much the same as what Gimp does and it
 * seems to be what artists expects from saving.
 * Do a conversion here, so image format writers could
 * happily assume all the alpha tricks were made already.
 * helps keep things locally here, not spreading it to
 * all possible image writers we've got.

followed by a call to IMB_alpha_under_color_float which will effectively blend in a black background color and set the alpha to 1.0, from this point forward we are *GUARANTEED* that Alpha will *ALWAYS* be 1.0, so taking the expensive 1 OCIO call per pixel path later to on deal with the alpha properly is on the "less optimal side of things"

So then what?

Well we are guaranteed A = 1.0 for the whole image, it may as well not exist! If it doesn't exist the RGB path will probably be taken which will hopefully operate on a full buffer rather than a pixel at at time.


Lets make a new function that'll take a RGBA buffer like IMB_alpha_under_color_floatbut will drop the Alpha rather than setting it to 1.0, given the buffer this function gets has been sized for W * H * 4 elements (RGBA) storing only W*H*3 (RGB) in it should not be a problem, and given this buffer is short lived, wasting 25% of the storage space can be forgiven since its still cheaper than allocating a new buffer and copying it over and freeing the old one. (the buffers are unlikely to sit close to each-other in memory, making it easier on the cache to re-use the existing buffer while doing the conversion, that being said, i'm unsure if this will make it through review unscathed)

Testing the idea

lets render the same scene again and see whats up

Alright we're down to 19.6 seconds, not as much as i'd hoped but 30% improvement in save time is still pretty good, and worth this small profiling excercise

before24.610.613.97 (100%)
after19.610.68.89 (64%)

Diff Detail

rB Blender
tmp_save_speedup (branched from master)
Build Status
Buildable 8737
Build 8737: arc lint + arc unit

Event Timeline

Ray molenkamp (LazyDodo) requested review of this revision.Fri, Jun 26, 3:31 AM
Ray molenkamp (LazyDodo) created this revision.
Ray molenkamp (LazyDodo) retitled this revision from Speed up saving when rending movies. to Speed up saving when rending movies / Profiling walk-though .
Ray molenkamp (LazyDodo) retitled this revision from Speed up saving when rending movies / Profiling walk-though to Speed up saving when rendering movies / Profiling walk-though .Fri, Jun 26, 6:00 AM
Ray molenkamp (LazyDodo) edited the summary of this revision. (Show Details)
Ray molenkamp (LazyDodo) edited the summary of this revision. (Show Details)
Ray molenkamp (LazyDodo) retitled this revision from Speed up saving when rendering movies / Profiling walk-though to Speed up saving when rendering movies / Profiling walk-though.
Ray molenkamp (LazyDodo) edited the summary of this revision. (Show Details)
  • fix wrong variable in IMB_alpha_under_color_float_v4v3