Page MenuHome

Cycles: GPU Performance
Confirmed, NormalPublicTO DO

Assigned To
None
Authored By
Brecht Van Lommel (brecht)
Apr 26 2021, 5:14 PM
Tokens
"Love" token, awarded by Yuro."Love" token, awarded by skyscapeparadise."Love" token, awarded by Mylo."Yellow Medal" token, awarded by ihorokari."Love" token, awarded by cathalyn."Love" token, awarded by Raimund58."Burninate" token, awarded by radi0n."100" token, awarded by ThinkingPolygons."Love" token, awarded by nunoconceicao."Love" token, awarded by BlenderEffi."Love" token, awarded by sander.c.vonk."Like" token, awarded by lordodin."Love" token, awarded by EmiMartinez."Love" token, awarded by silex."Love" token, awarded by Shimoon."Love" token, awarded by 123blender123."Love" token, awarded by bnzs."Love" token, awarded by kivig."Love" token, awarded by Maged_afra."Party Time" token, awarded by gilberto_rodrigues."Love" token, awarded by Roggii."Burninate" token, awarded by HEYPictures."Love" token, awarded by harvester."Love" token, awarded by Alaska."Love" token, awarded by MetinSeven."Love" token, awarded by someuser.

Description

Memory usage
  • Auto detect good integrator state size depending on GPU (hardcoded to 1 million now)
  • Reduce size of IntegratorState
    • Don't store float4 as float3
    • Don't allocate memory for unused features (volumes, sss, denoising, light passes)
    • Dynamic volume stack size depending on scene contents (D12925)
    • Overlap SSS parameters with other memory
    • Compress some floats as half float
    • Reduce bits for integers where possible
    • Can diffuse/glossy/transmission bounces be limited to 255?
  • SoA
    • Individual arrays for XYZ components of float3?
    • Pack together 8bit/16bit values into 32bit?
  • Reduce size of ShaderData
    • Compute some differentials on the fly?
    • Read some data directly from ray/intersection?
    • Don't copy matrix if motion blur is disabled
  • Reduce kernel local memory
    • Dynamically allocate closures based on used shaders
    • Dynamically allocate SVM stack based on used shaders
    • Check if SVM stack is allocated multiple times
    • Check on deduplicating ShaderData instances
    • Check for other unknown sources of memory usage
    • Use either shade_surface or shade_surface_raytrace for reserving memory
Kernel Size
  • Replace megakernel used for last few samples? Only helps about 10% with viewport render, barely with batch render. But makes OptiX runtime compilation slow.
  • Make svm_eval_nodes a templated function and specialize it for
    • Background
    • Lights
    • Shadows
    • Shader-raytracing
    • Volumes
  • Avoid shader ray-tracing for AO pass (D12900)
  • Verify if more specialization is possible in svm_eval_nodes (seems not)
  • Deduplicate shader evaluation call in shade_background
Scheduling
  • Make shadow paths fully independent so they do not block main path kernels (D12889)
  • Accurately count number of available paths for scheduling additional work tiles
  • Compact shadow paths similar to main paths (D12944)
  • Consider adding scrambling distance in advanced options (D12318)
  • Compact path states for coherence
  • Tweak tile size for better coherence
  • Tweak work tiles or pixel order to improve coherence (many small tiles, Z-order, ..)
  • Try other shader sorting techniques (but will be more expensive than bucket sort)
    • Take into account object ID
    • Take into account random number for BSDF/BSSRDF selection?
  • Overlapped kernel execution
    • Use multiple independent GPU queueus? (so far was found to be 15% slower)
    • Use multiple GPU queues to schedule different kernels?
  • Optimize active index, sorting and prefix sum kernels
    • Parallelize prefix_sum
    • Build active/sorted index array for specific based on another array indicating active paths for all kernels, especially when number of paths is small
  • Try pushing (part of) integrator state in queues rather than persistent location, for more coherent memory access
    • Check potential performance benefit by coalescing state
    • Shadow paths
    • Main path

[ ] Cancelling renders and updates can be slow due to the occupancy heuristic that schedules more samples. Find a way to reduce this problem.

Display
  • For final render, let Cycles draw the image instead of copying pixels to Blender
    • Render pass support
Render Algorithms
  • Use native OptiX curve primitive for thick hair
  • Transparent shadows: can we terminate OptiX rays earlier when enough hits are found? (D12524)
  • Transparent shadows: tune max hits for performance / memory usage
  • Detect constant transparent shadows for triangles and avoid recording intersection and evaluating shader entirely
  • Detect transparent shadows that are purely an image texture lookup and perform it in the hit kernel
  • For volume stack init, implement volume all intersection that writes directly into the stack
Tuning
  • Automatically increase integrator state size depending on available memory
  • Tweak kernel compilation parameters (num threads per block, max registers)
    • Different parameters per kernel?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Current progress on trying to eliminate the megakernel is in P2111, still not where I want it to be.

Compacting the state array seems not all that helpful.

What I did notice while working on that is that in the pvt_flat scene, the number of active paths often drops to a low number but is not refilled quickly. Reducing the tile size to avoid that not only avoids the performance regression, but actually speedups the rendering. However this slows down other scenes.

                              compact+tilesize/4    tilesize/4     compact       no-megakernel      megakernel
bmw27.blend                   8.34171               8.07403        7.54844       7.5659             7.71559                             
classroom.blend               11.2431               10.9255        10.6584       10.6141            10.8755                             
pabellon.blend                5.91434               5.96131        5.70281       5.73752            5.87296                             
monster.blend                 6.65674               6.52214        6.94078       7.00113            8.08601                             
barbershop_interior.blend     8.43154               7.90963        8.20633       8.16851            8.75592                             
junkshop.blend                10.2858               10.4201        10.5334       10.5217            11.1836                             
pvt_flat.blend                10.116                10.2103        12.115        12.4971            11.0911

There must be something that can be done to get closer to the best of both.

Looking at the kernel execution times of bmw27, it's clear that optimizing init_from_camera for multiple work tiles would help, but it's only a part of the performance gap. There's something else going on here that is harder to pin down.

compact+tilesize/4
6.71538s: integrator_shade_surface integrator_sorted_paths_array prefix_sum
1.47519s: integrator_intersect_closest integrator_queued_paths_array
0.53159s: integrator_intersect_shadow integrator_queued_shadow_paths_array
0.38923s: integrator_shade_shadow integrator_queued_shadow_paths_array
0.32022s: integrator_init_from_camera integrator_terminated_paths_array
0.16891s: integrator_shade_background integrator_queued_paths_array

no-megakernel
6.16877s: integrator_shade_surface integrator_sorted_paths_array prefix_sum
1.17981s: integrator_intersect_closest integrator_queued_paths_array
0.41735s: integrator_shade_shadow integrator_queued_shadow_paths_array
0.36235s: integrator_intersect_shadow integrator_queued_shadow_paths_array
0.21522s: integrator_shade_background integrator_queued_paths_array
0.17480s: integrator_init_from_camera integrator_terminated_paths_array

Note about differentials: PBRT-v4 is not even passing differentials along with rays, but simply computing them using the camera information. This gives incorrect results through reflections and refractions, but may be close enough in practice.

nik77 added a subscriber: nik77.Aug 1 2021, 6:21 PM

Trying to figure out which parts of shade_surface_raytrace kernel are using most local memory:

zero max closures:                -64%
zero SVM stack size:              -11%
remove svm_eval_nodes:            -32%
remove direct light + AO pass:    -8%
remove voronoi node:              -1%
remove all texture nodes:         -2%
remove all closure nodes:         -2%
remove bevel + AO nodes:          -14%
remove all nodes but one:         -20%

So roughly:

  • 65% closures
  • 15% bevel + AO nodes
  • 10% SVM stack size
  • 5% other nodes
  • 5% other (including shader data)
Brecht Van Lommel (brecht) renamed this task from Cycles X - GPU Performance to Cycles: GPU Performance.Oct 28 2021, 2:54 PM
Brecht Van Lommel (brecht) updated the task description. (Show Details)
Yuro (Yuro) added a subscriber: Yuro (Yuro).