Page MenuHome

DRW: Refactor to support draw call batching
ClosedPublic

Authored by Clément Foucault (fclem) on Jun 3 2019, 1:24 PM.

Details

Summary

This refactor improve draw structures CPU/Memory efficiency and lower the
driver overhead of doing many drawcalls.

  • Model Matrix is now part of big UBOs that contain 1024 matrices.
  • Object Infos follow the same improvement.
  • Matrices are indexed by gl_BaseInstanceARB or a fallback uniform.
  • All these resources are using a single 32bit identifier (DRWResourceHandle).
  • DRWUniform & DRWCall are alloced in chunks to improve cache coherence & memory usage.
  • DRWUniform now support up to vec4_copy.
  • Draw calls are allocated in chunks of 128 calls.
  • Draw calls are sorted by GPUBatches inside a chunk, to improve batching process.
  • Draw calls are batched together if their resource id are consecutive.
  • Draw calls are now batched into command lists to leverage Multi Draw Indirect which boosts performance significantly.

This has a great impact on CPU usage when using lots of instances. Even if the biggest
bottleneck in these situations is the depsgraph iteration, the driver overhead when doing
thousands of drawcalls is still high.

This only improve situations where the CPU is the bottleneck: small geometry, lots of
instances.

The next step is to sort the drawcall inside a DRWCallChunk to improve the batching process
when instancing order is pretty random.
Done

Test scenes:

Diff Detail

Repository
rB Blender

Event Timeline

  • Object Mode: Outlines: Rewrite id pass generation
  • DRW: Add builtin uniform to get full DRWResourceHandle from shader
  • Object Mode: Add back lightprobe selection outlines
  • DRW: Use int instead of uint for DRWCall
  • DRW: Remove common_view_lib uniform default values
  • Edit Curve: Fix curve normals
  • Cleanup: GPUBatch: rename arguments
  • DRW: Make workaround for drivers with broken gl_InstanceID
  • Workbench: Use resource_id instead of own index
  • Workbench: Remove object_id and optimize material hash generation
  • DRW: Add draw call sorting
  • GPU: Add API to use multidrawindirect using GPUbatch
  • DRW: Use new GPUDrawList to speedup instancing
  • Workbench: Simplify / Speedup Material Hash

General stabilization and rebase on top of master

@Clément Foucault (fclem), it would help if you could push this as a temporary branch, so I can see the individual commits.

@Clément Foucault (fclem), it would help if you could push this as a temporary branch, so I can see the individual commits.

Done! the branch is tmp-drw-callbatching.

Some perf numbers using the perf_instancing_meshes.blend :

AMD Workstation | 29.0 -> 44.5 fps
Laptop Nvidia   | 24.0 -> 27.0 fps
Laptop Intel    |  4.7 ->  2.7 fps

The issue with intel is related to the ModelMatrix moving to UBO storage. Apparently it is much slower to fetch but I'm not sure if this is due to a bad driver or a GPU design thing.
More testing on intel GPU would be welcome.

High poly performance doesn't seem to be affected.

It seems quite strange that this is still slow when this is the prefered way of rendering a bunch of objects. I suspect may be a thread divergence issue since the high poly perf is not affected.

Brecht Van Lommel (brecht) requested changes to this revision.Fri, Aug 23, 4:32 PM

I guess the Intel performance regression is something to fix before this can be committed? Or is it mainly in an artificial case, and typical setups are not affected much?

The tmp-drw-callbatching branch is passing the Eevee tests, but there are some failures in the workbench tests. These do not happen when I go back to the last commit from master in that branch (0e1d4de).

I spent a while going through these commits and trying to understand them at a high level. I didn't see anything obviously bad, but I'd be lying if I said I understood everything or checked the details. Let me know if you want me to review in more detail.

source/blender/draw/intern/draw_manager_data.c
1267–1275

To be solved still?

source/blender/draw/intern/draw_manager_exec.c
1097

To be solved still?

1208

0 -> 0x0

This revision now requires changes to proceed.Fri, Aug 23, 4:32 PM
  • DRW: Fix regression when rendering using index ranges
  • DRW: Resource Handle: Use manual bitmasks and bitsifts
  • Workbench: Fix volumetric rendering
  • GPencil: Fix crash during render
  • DRW: Add DRW_shgroup_clear_framebuffer
  • GPencil: Replace stencil hack by clear commands
Clément Foucault (fclem) marked 3 inline comments as done.Thu, Sep 5, 4:00 PM

The regression were fixed but the tests hair_instancer_uv and visibility_particles were actually broken in 2.80 (scale affecting normals). Will update tests references after committing.

source/blender/draw/intern/draw_manager_exec.c
1097

Yes but this is for compatibility with shaders from gpu_shader_*. Needs more work to get rid all their usage in DRW.

Clément Foucault (fclem) marked an inline comment as done.Fri, Sep 6, 12:50 AM

About the intel performance regression. I would really like some help in testing if the difference is noticeable in normal scene on other drivers/GPU.

Like I said, it does not appear to have a perf impact on high poly meshes (where drawcalls have a lot of polygons).

Keeping a deprecated code path for intel is quite annoying so I would prefer not to.

The grease pencil test is failing for me after this, with Cycles/Eevee/Workbench. The red stroke is missing.

Regarding performance:

  • Can you perhaps test the Eevee demo files on Intel? If those are ok I guess this is fine.
  • Is it possible to bisect the commits in the branch to find out which change may have caused the Intel performance regression?
  • Maybe @Germano Cavalcante (mano-wii) can help debugging or has some ideas for what might be causing it?

On the Intel laptop I'm facing a crash that can be roughly resolved like this:

diff --git a/source/blender/draw/modes/shaders/object_outline_detect_frag.glsl b/source/blender/draw/modes/shaders/object_outline_detect_frag.glsl
index f016bb7ead4..1c3bace3f63 100644
--- a/source/blender/draw/modes/shaders/object_outline_detect_frag.glsl
+++ b/source/blender/draw/modes/shaders/object_outline_detect_frag.glsl
@@ -36,7 +36,7 @@ void main()
 {
   ivec2 texel = ivec2(gl_FragCoord.xy);
 
-#ifdef GPU_ARB_texture_gather
+#if 0
   vec2 texel_size = 1.0 / vec2(textureSize(outlineId, 0).xy);
   vec2 uv = ceil(gl_FragCoord.xy) * texel_size;
 
diff --git a/source/blender/draw/modes/shaders/object_outline_prepass_frag.glsl b/source/blender/draw/modes/shaders/object_outline_prepass_frag.glsl
index 0d64a070429..5d6c4881b5b 100644
--- a/source/blender/draw/modes/shaders/object_outline_prepass_frag.glsl
+++ b/source/blender/draw/modes/shaders/object_outline_prepass_frag.glsl
@@ -8,8 +8,9 @@ flat in int objectId;
 out uint outId;
 
 /* Replace top 2 bits (of the 16bit output) by outlineId.
- * This leaves 16K different IDs to create outlines between objects. */
-#define SHIFT (32u - (16u - 2u))
+ * This leaves 16K different IDs to create outlines between objects.
+ * SHIFT = (32 - (16 - 2)) */
+#define SHIFT 18u
 
 void main()
 {

It is strange that GPU_ARB_texture_gather is no longer working. I've seen something similar happening due to a misaligned UBO. It's good to investigate.

  • GPencil: Fix wrong stencil ID clearing
  • GPU: Fix shader crashing on some intel gpu
  • DRW: Fix crash with old intel GPU driver

A test with the spaceship test file show that the slowdown is not acceptable:
Workbench time 9.7ms > 15.5ms

It is clearly caused by the matrices being in UBOs. This patch alone bright the performance back to normal.

diff --git a/source/blender/draw/modes/shaders/common_view_lib.glsl b/source/blender/draw/modes/shaders/common_view_lib.glsl
index 4c057d2ba88..c1d14a6c37c 100644
--- a/source/blender/draw/modes/shaders/common_view_lib.glsl
+++ b/source/blender/draw/modes/shaders/common_view_lib.glsl
@@ -80,8 +80,10 @@ layout(std140) uniform modelBlock
   ObjectMatrices drw_matrices[DRW_RESOURCE_CHUNK_LEN];
 };
 
-#define ModelMatrix (drw_matrices[resource_id].drw_modelMatrix)
-#define ModelMatrixInverse (drw_matrices[resource_id].drw_modelMatrixInverse)
+uniform mat4 unitMat = mat4(1);
+
+#define ModelMatrix (unitMat)
+#define ModelMatrixInverse (unitMat)
 
 #define resource_handle (resourceChunk * DRW_RESOURCE_CHUNK_LEN + resource_id)

So I'm trying to find a workaround like passing a lot of matrices in an array of uniform but it will be way less than what the UBO can contain.

Ok just adding back ModelMatrix triggers the legacy path and just draw without instancing, bringing back old performance for intel.

It's an easy fix.

  • Fix intel performance regression

Putting ModelMatrix and ModelMatrixInverse AND ifdefing the UBO declaration
effectivelly triggers the old legacy path.

We now don't see any performance regression on intel GPUs with this patch.

That said this is only based on one GPU + driver test. We might reduce this
fix on specific drivers string match.

I didn't notice performance changes on Windows + Intel(R) HD Graphics 4000

I think this is ready to go in then?

This revision is now accepted and ready to land.Wed, Sep 11, 11:16 AM
This revision is now accepted and ready to land.Tue, Sep 17, 2:57 PM
  • Fix Gpencil matrix
  • Fix crash due to drawing with buffers not big enough

Verified to work correct with various test files on Linux + Windows and Quadro RTX 5000. Also confirmed that the previous commit was crashing on Windows and that this solves it.

Tested in Windows 10 64 with RTX2080TI and now no crashes and grease pencil stroke are visible while drawing.