Page MenuHome

Mesh Batch Cache: Refactor + Multithread
ClosedPublic

Authored by Clément Foucault (fclem) on Aug 6 2019, 3:21 PM.

Details

Summary

For clarity sake, the batch cache now uses exclusively per Loop attributes.
While this is a bit of a waste of VRAM (for the few case where per vert
attribs are enough) it reduces the complexity and amount of overall VBO
to update in general situations.

This patch also makes the VertexBuffers filling multithreaded. This make the
update of dense meshes a bit faster. The main bottleneck is the
IndexBuffers update which cannot be multithreaded efficiently (have to
increment a counter and/or do a final sorting pass).

We introduce the concept of "extract" functions/step.
All extract functions are executed in one thread each and if possible,
using multiple thread for looping over all elements.

My result (an heavilly subdivided sphere + lvl4 subsurf):

            |   Fps | Frame | Iter | Rdata
2.80master  |   4.8 | 206ms | 82ms |  11ms
Base        |   4.5 | 240ms | 98ms |  22ms
Opti        |   6.4 | 144ms | 22ms |  20ms

The 9ms speed loss (in Rdata) is that we require loop normals to be precomputed
before iteration. We can still recover this (this is a TODO) Done.

To reviewers: The multi-thread part starts in mesh_buffer_cache_create_requested.

GPU

  • Add GPUIndexBuf subrange This allows to render only a subset of an index buffer. This is nice as we can render each material surfaces individually and the whole mesh with the same index buffer.
  • Add vertex format deinterleaving This makes it possible to have each attrib use a contiguous portion of the vertex buffer, making attribute filling much more easy and fast as this is how they are store in blender Custom Data layers.
  • Batch: Reverse order of VBO binding This is to ensure the vbo[0] always has predecence over other VBO. This is important for overriding attributes by switching vbo binding order.
  • Make small float normal compression functions inlined Remove some overhead in vbo creation.

Mesh Batch Cache: Refactor

  • Restructure the buffers cache : One cache for final mesh and one for the edit mesh cage.
  • Add debug timer.
  • Use Extract naming convention to name extract functions that fill vbo/ibo.
  • Separate extract functions into separate file (for clarity).
  • Separate loose elements looping functions to avoid iteration complexity. (unfortunately this makes the code more verbose).
  • Some iter functions are threadable and tagged as such.
  • Add multithreaded iteration for extract functions that supports them.

Diff Detail

Repository
rB Blender
Branch
tmp-batch-cache-cleanup
Build Status
Buildable 4430
Build 4430: arc lint + arc unit

Event Timeline

Maybe it is not the place to put it but this is what a frame in the profiler looks like:

Note that we could get rid of the looptri calculation and index buffers (computation + data upload) if we cache the topology.

I also don't know if OSD is fully optimized given this visualization.

Great work, the timing results speak for themselves.

I'm getting a crash with tangents in the Eevee render tests, for example tests/render/mesh/tangent_no_uv.blend.

#0  0x00007ffff6561e97 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff6563801 in __GI_abort () at abort.c:79
#2  0x00000000027e542a in extract_orco_init (mr=0x7fff7acb9308, buf=0x7fff3278a208) at /home/brecht/dev/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:1852
#3  0x00000000027e9627 in extract_task_create (task_pool=0x7fff33dee188, mr=0x7fff7acb9308, extract=0x7d2ade8 <extract_orco>, buf=0x7fff3278a208, task_counter=0x7fff5c085d94) at /home/brecht/dev/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:3564
#4  0x00000000027e9014 in mesh_buffer_cache_create_requested (cache=<optimized out>, mbc=..., me=<optimized out>, do_final=<optimized out>, do_uvedit=<optimized out>, use_subsurf_fdots=false, cd_layer_used=<optimized out>, ts=<optimized out>, use_hide=<optimized out>)
    at /home/brecht/dev/blender/source/blender/draw/intern/draw_cache_extract_mesh.c:3699
#5  0x00000000027cdfae in DRW_mesh_batch_cache_create_requested (ob=0x7fff7acc7608, me=<optimized out>, scene=<optimized out>, is_paint_mode=<optimized out>, use_hide=false) at /home/brecht/dev/blender/source/blender/draw/intern/draw_cache_impl_mesh.c:1274
#6  0x0000000002773705 in DRW_render_object_iter (vedata=0x7fff8c3d3a88, engine=0x7fff7aca2008, depsgraph=0x7fff7abd9288, callback=0x279f120 <EEVEE_render_cache>) at /home/brecht/dev/blender/source/blender/draw/intern/draw_manager.c:2074
#7  0x000000000278fc02 in eevee_render_to_image (vedata=0x7fff8c3d3a88, engine=0x7fff7aca2008, render_layer=0x7fff7abd3f88, rect=0x7fff7b5cdc98) at /home/brecht/dev/blender/source/blender/draw/engines/eevee/eevee_engine.c:449
#8  0x0000000002773373 in DRW_render_to_image (engine=0x7fff7aca2008, depsgraph=0x7fff7abd9288) at /home/brecht/dev/blender/source/blender/draw/intern/draw_manager.c:2013
#9  0x00000000010bc359 in RE_engine_render (re=<optimized out>, do_all=<optimized out>) at /home/brecht/dev/blender/source/blender/render/intern/source/external_engine.c:778
#10 0x00000000010bf3e9 in do_render_3d (re=<optimized out>) at /home/brecht/dev/blender/source/blender/render/intern/source/pipeline.c:1147
#11 0x00000000010bf3e9 in do_render (re=<optimized out>) at /home/brecht/dev/blender/source/blender/render/intern/source/pipeline.c:1224
#12 0x00000000010bf3e9 in do_render_composite (re=<optimized out>) at /home/brecht/dev/blender/source/blender/render/intern/source/pipeline.c:1445
#13 0x00000000010bf3e9 in do_render_all_options (re=0x7fff99c04c08) at /home/brecht/dev/blender/source/blender/render/intern/source/pipeline.c:1719
#14 0x00000000010bef03 in RE_RenderFrame (re=0x7fff99c04c08, bmain=<optimized out>, scene=0x7fffd2923808, single_layer=<optimized out>, camera_override=<optimized out>, frame=<optimized out>, write_still=false) at /home/brecht/dev/blender/source/blender/render/intern/source/pipeline.c:2121
#15 0x000000000348a676 in render_startjob (rjv=0x7fff99dd31c8, stop=<optimized out>, do_update=<optimized out>, progress=<optimized out>) at /home/brecht/dev/blender/source/blender/editors/render/render_internal.c:670
#16 0x00000000012315f2 in do_job_thread (job_v=0x7fffd2841408) at /home/brecht/dev/blender/source/blender/windowmanager/intern/wm_jobs.c:383
#17 0x00007ffff7bbd6db in start_thread (arg=0x7fff7b5d1700) at pthread_create.c:463
#18 0x00007ffff664488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
source/blender/gpu/GPU_vertex_format.h
157

Since they are in a header file, maybe rename these utility functions to add a prefix like gpu_normal_quantize. They are rather generic now and could conflict.

Opening the mr_elephant.blend demo file crashes with the same backtrace. I didn't find other crashes in testing, tried editing in multiple modes with various data layers.

source/blender/draw/intern/draw_cache_extract_mesh.c
1562

To generate a safe unique GLSL name, you can encode the char array to base16 which would have double the length. Something like this (untested):

for (int i = 0; str[i]; i++) {
  str_base16[i*2 + 0] = 'a' + ((str[i] >> 4) & 16);
  str_base16[i*2 + 1] = 'a' + (str[i] & 16);
}

Have you tested performance with many low poly meshes? I don't know exactly what the overhead is of threading and task pools for many meshes.

It might make sense to disable threading below a certain number of vertices.

source/blender/draw/intern/draw_cache_extract_mesh.c
1598

MAX_NAME -> MAX_CUSTOMDATA_LAYER_NAME

They are the same for now, but regardless.

Brecht Van Lommel (brecht) requested changes to this revision.Aug 13 2019, 5:39 PM

I did not review every line in detail, but didn't find anything else to comment on, seems generally fine.

This revision now requires changes to proceed.Aug 13 2019, 5:39 PM
  • Mesh Batch Cache: Use threading only when mesh is large
  • Mesh Batch Cache: Remove mandatory creation of loop normals array
  • Mesh Batch Cache: Port BMesh mesh analysis to vbo extract
  • Mesh Batch Cache: Fix crash when using orco tangents without orco layer
Clément Foucault (fclem) marked 3 inline comments as done.
  • GPU: VertexFormat: Use prefix on inline functions
source/blender/draw/intern/draw_cache_extract_mesh.c
1562

This is a good idea. I played with this and I found we can pack using base63 (grr not base64 because var names can contain [a-zA-Z0-9_]). This reduces the number of chars used. Unfortunately we cannot have very long attribute names because they are hashed upon request and they are stored in a fixed length buffer. So the smaller the better.

I will add this in another patch.

(BTW: it should be ((str[i] >> 4) & 15) for the above code).

Have you tested performance with many low poly meshes? I don't know exactly what the overhead is of threading and task pools for many meshes.

I did not test but I disabled threading for now if the mesh is small ( < 8K Loops (including loose verts) ). The next step would be to have a pool common to all meshes and only sync before drawing.

Brecht Van Lommel (brecht) requested changes to this revision.Aug 14 2019, 3:21 PM

tangent_no_uv.blend is fixed, but still crashes on tangent_render_uv.blend and tangent_specific_uv_*.blend

Otherwise looks good to me.

This revision now requires changes to proceed.Aug 14 2019, 3:21 PM
  • Mesh Batch Cache: Use MAX_CUSTOMDATA_LAYER_NAME instead of MAX_NAME
  • Mesh Batch Cache: Fix crash in regression test concerning tangents

Looks good now.

This causes two render tests failure still, but it's the new render that is correct so those images are to be updated.

This revision is now accepted and ready to land.Aug 14 2019, 6:13 PM