Cycles: Add sample-based runtime profiler that measures time spent in various parts of the CPU kernel
Needs ReviewPublic

Authored by Lukas Stockner (lukasstockner97) on Sat, Nov 3, 2:20 AM.



The actual info that ends up printed is still WIP, but in general I think having this information makes a lot of sense, especially since I don't measure any performance hit.

Right now, it shows a breakdown of the time spent in various parts of the kernel, total time spent on all objects and shaders, and relative cost of shaders and objects (how much time they took vs. how often they were hit).

Diff Detail

rB Blender
Build Status
Buildable 2456
Build 2456: arc lint + arc unit

Here's an example output of the BMW scene:

LazyDodo (LazyDodo) rescinded a token.
LazyDodo (LazyDodo) awarded a token.
Brecht Van Lommel (brecht) requested changes to this revision.Sat, Nov 3, 1:32 PM

This looks great, that output is super useful.

Can we avoid creating a profiling thread for every render thread, and have just one? And also keep the same profiling thread for the entire render session? Right now it seems to creating and destroying threads continuously for viewport rendering.

The way I imagined this to work is that it would just be a single thread gathering the stats for all threads, without needing to merge stats.


Do we really need to use a set for this? If the events are next to each other in the enum we can just check if it's within a range.


Code style: { on new line.


Unused variable.

This revision now requires changes to proceed.Sat, Nov 3, 1:32 PM
Brecht Van Lommel (brecht) requested changes to this revision.Sun, Nov 4, 7:12 PM
  • Can we disable the stats threads when the stats are not needed? Particularly for viewport and icon preview renders.
  • For GPU rendering the stats are still printed with lots of NaNs. Instead it should print a line about stats not being supported for GPU render.
  • Under Object Statistics, I see the same name printed many times when there are instances (for example the fishy cat test scene). These should be merged.
  • What is "Relative Cost" exactly? Is it a percentage, and if so, can we display the "%" symbol next to the number?

Other than that this looks ready to me.


Remove debug print.

This revision now requires changes to proceed.Sun, Nov 4, 7:12 PM

Addressed feedback.

The relative cost measures (time spent in a shader)/(times a shader has been hit) relative to the overall time per hit of all shaders. Therefore, it's a measure of how complex the shader is. I guess it could be displayed as percent, but I'm not sure, I'll think about it.

Also, I added profiling output to the Denoiser. Technically this one could be profiled on CUDA/OpenCL as well since it records the events on the host side, but I don't think adding the additional code for handling the special case of only profiling one type is worth it.

Brecht Van Lommel (brecht) requested changes to this revision.Wed, Nov 7, 10:46 AM

Minor comments.


For more code sharing with e.g. Cycles standalone, would be good if these 3 lines were moved into a session->collect_statistics() function.


Put the is_cpu && background part of the test to disable profiling here, so it's not Blender specific and doesn't cause accidental performance impact for other Cycles integrations.


This has O(n^2) complexity, which is a problem when there are many objects. It should use a hash lookup.


Capitalize: "Non-local Means".

This revision now requires changes to proceed.Wed, Nov 7, 10:46 AM
Lukas Stockner (lukasstockner97) marked 3 inline comments as done.Thu, Nov 8, 4:28 AM
Lukas Stockner (lukasstockner97) added inline comments.

CPU makes sense but I'm not sure about the background part - after all, params.profiling is false by default, so other Cycles integrations shouldn't be affected unless they explicitly support it in their code. Enabling it for GPU never makes sense, but someone might want to enable it for viewport rendering so I don't see why we should force it to be disabled here when it's already disabled for Blender and by default for others.


I tried that at first but didn't really find a way to implement it nicely.

I had another go at it and the result is not too bad now, so I think it's worth it for the performance improvement.

Lukas Stockner (lukasstockner97) marked an inline comment as done.

Addressed feedback.

Hey, I was working on something like that too :D

Mine is based on PBRT's profiler, which instead of launching extra threads uses SIGPROF to do the sampling:

By limiting its sampler to 64 states, PBRT can implement it as a simple atomic integer with little overhead.

Looks good to me now.

SIGPROF is lower overhead I guess, avoiding a thread context switch? I wouldn't mind switching to it when supported (so not on Windows). The current state could be compressed to 64 bits, by using some of the bits from the shader int. Scenes don't have that many shaders anyway.

These kinds of optimizations can be done after this is in master I think.

This revision is now accepted and ready to land.Thu, Nov 8, 2:07 PM

Why not to move this to a device info?

1398 ↗(On Diff #12444)
TODO(name): Sentence, which starts from a capital and ends with a fullstop,

This is not an id i guess, but more like an index. If this is controlled by object, make it private. And in any case, consider adding a comment whether it's "any unique integer value" or is coming from a continuous indices of objects in the ObjectManager.

P.S. From reading further, the real name is index_in_device (adjust to something shorter if possible), and comment should also state that device update is taking care of this field, and that nobody is supposed to initialize it in order to get Cycles integration to work.


do_profiling or is_profiling ?


const ?




num_hits ?


Comment should state what those are indexed by, and what are the stored value are. How often is not descriptive, is it frequency in herz? Total number of events?


Indexed by ... ?


do_stop_worker ?

Sergey Sharybin (sergey) requested changes to this revision.Thu, Nov 8, 2:35 PM
This revision now requires changes to proceed.Thu, Nov 8, 2:35 PM
Lukas Stockner (lukasstockner97) marked 10 inline comments as done.Fri, Nov 9, 2:57 AM

SIGPROF is interesting, but since it's a process-level signal, it might be a bad idea to set a handler in Cycles which is supposed to be used as a library after all...

I'm not sure what the advantage of having a single 64-bit state would be? I've added a comment in the code that goes into a bit of detail, but I'll add some more here:

  • The reads/writes that the code currently uses are guaranteed to be atomic on x86/x64 (and ARM afaik) even without explicit atomics/locking
  • Even on other exotic architectures where this might not be the case it would only show up as individual samples in the wrong event (since the profiler thread checks the values it just read, corrupt event/ids will be ignored), probably not even noticeable in practice
  • There is no guaranteed memory order, but that doesn't matter since there is no real interdependence between the states (if e.g. an additional field changes its meaning based on the current state, it would matter, but currently that's not the case)
  • Packing the current state into a compact format would add complexity to the state updates, and I'm trying to keep those as small as possible to minimize the performance impact (since they also run when the profiler isn't used)

Not really, it refers to the sum of the hits of this entry and all its sub-entries. The naming here is a bit confusing in general (for example, these actually are samples, not hits, I mixed up the labels), the new version should be clearer.

Lukas Stockner (lukasstockner97) marked an inline comment as done.

Addressed feedback.