Highpoly mesh sculpting performance #68873

Open
opened 2019-08-20 17:03:21 +02:00 by Pablo Dobarro · 31 comments
Member

Updating smooth normals

BKE_pbvh_update_normals takes up 50% of of time in some cases. The use of atomics in pbvh_update_normals_accum_task_cb to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities:

  • Figure out which vertices are shared with other nodes, and only use atomics for those.
  • Store adjacency info for the entire mesh, and gather normals per vertex.
  • Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics.
  • Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step.
  • ...?

Smoothing tools may also be able to benefit from this, to work without the overhead of storing adjacency info of the entire mesh.

Depending if the brush tool needs normals, we could delay updating normals outside of the viewport.

Coherent memory access

Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access.

Two things we can do here:

  • Ensure vertex and face indices in a node are at least sorted.
  • Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes.

For multires this is less of an issue since all vertices within a grids are in one block, though it might help a little bit to not allocate every grid individually and instead have one allocation per node.

Partial redraw

  • #70295 (Sculpt partial redraw not working)
  • For symmetry, the region covered by partial redraw can become arbitrarily big. We could add a test to see if one side is offscreen entirely. Other than that, we'd need to define multiple regions, which could then be used for culling the PBVH nodes. Drawing the viewport one time for each region is likely slow though, so it could be rendered at full viewport resolution but blit multiple separate regions.

Draw buffers

Masks

Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it.

Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay.

Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once.

Consolidate vertex loops

There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc.

Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses.

Bounding box frustum tests

Sculpt tools that take into account the frustum only use 4 clipping planes, we should add another plane to clip nodes behind the camera. But unlike drawing, don't do use clip end and always have clip start equal to 0.

Frustum - AABB intersection tests do not appear to be a bottleneck currently. But some possible optimizations here:

  • For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes
  • Multithreaded tree traversal, these tests are single threaded in most cases now
  • Cache visibility of nodes per viewport
  • Use center + half size instead of min + max for storing bounding boxes

Threading

  • It may be worth testing if the current settings for BLI_parallel_range_settings_defaults are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static.

Changed now to remove node limit and use dynamic scheduling with chunk size 1, gave about a 10% performance improvement. For a high number of nodes it may be worth increasing the chunk size.

Symmetry

For X symmetry we currently do 2 loops over all vertices, and then do another loop to merge them. These 3 could perhaps be merged into one loop, though code might become significantly more complicated as every brush tool may need to code to handle symmetry.

Low level optimizations

Overall, this kind of optimization requires carefully analyzing code that runs per mesh element, and trying to make it faster.

Sculpt tools support many settings, and the number of functions calls, conditionals and following of pointers adds up. It can be worth testing what happens when most of the code is removed, what kind of overhead there is.

It can help to copy some commonly used variables onto the stack functions, ensuring that they can stay in registers and avoiding pointer aliasing. Test that check multiple variables could be precomputed and the result stored in a bitflag.

More functions can be inlined in some cases. For example bmesh iterators used for dyntopo go through function pointers and function calls, while they really can be a simple double loop over chunks and the elements within the chunks.

PBVH building

Building the PBVH is not the most performance critical since it only happens when entering sculpt mode, but there is room for optimization anyway. The most obvious one is multithreading.

Brush radius bounds

Culling of nodes outside the brush radius is disabled for 2D Falloff:

bool sculpt_search_circle_cb(PBVHNode *node, void *data_v)
{
  ...
  return dist_sq < data->radius_squared || 1;
}

Elastic Deform has no bounds, but it may be possible to compute some even if they are bigger than the brush radius.

Memory allocations for all vertices

Some sculpt tools allocate arrays the size of all vertices for temporary data. For operations that are local, it would be better to allocate arrays per PBVH node when possible.

In some cases this might make little difference, virtual memory pages may be mapped on demand until there are actual reads/writes (though this is not obviously guaranteed for all allocators and operating systems?).

Also regarding coherent memory access, this could improve performance, if vertices are grouped per node as described above.

Undo

Undo pushes all nodes that are whose bounding boxes are within the brush radius. However that doesn't mean any vertices in that node are actually affected by the brush. In a simple test painting on a sphere, it pushed e.g. 18 nodes but only actually modified 7.

We can reduce undo memory by delaying the undo push until we know any vertices within the node are about to be modified, though this may have a small performance impact. Ideally this would take into account both the brush radius test and masking/textures.

Similarly, we also sometimes call BKE_pbvh_node_mark_redraw or BKE_pbvh_node_mark_normals_update for nodes without checking if any vertices within have actually been modified.

## Updating smooth normals `BKE_pbvh_update_normals` takes up 50% of of time in some cases. The use of atomics in `pbvh_update_normals_accum_task_cb` to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities: * Figure out which vertices are shared with other nodes, and only use atomics for those. * Store adjacency info for the entire mesh, and gather normals per vertex. * Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics. * Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step. * ...? Smoothing tools may also be able to benefit from this, to work without the overhead of storing adjacency info of the entire mesh. Depending if the brush tool needs normals, we could delay updating normals outside of the viewport. ## Coherent memory access Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access. Two things we can do here: * Ensure vertex and face indices in a node are at least sorted. * Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes. For multires this is less of an issue since all vertices within a grids are in one block, though it might help a little bit to not allocate every grid individually and instead have one allocation per node. ## Partial redraw * #70295 (Sculpt partial redraw not working) * For symmetry, the region covered by partial redraw can become arbitrarily big. We could add a test to see if one side is offscreen entirely. Other than that, we'd need to define multiple regions, which could then be used for culling the PBVH nodes. Drawing the viewport one time for each region is likely slow though, so it could be rendered at full viewport resolution but blit multiple separate regions. ## Draw buffers - [x] [D5926: Sculpt: multithread GPU draw buffer filling](https://archive.blender.org/developer/D5926) - [x] [D5922: Sculpt: only update GPU buffers of PBVH nodes inside the viewport](https://archive.blender.org/developer/D5922) - [ ] Use `GPU_vertbuf_raw_step` to reduce overhead of creating buffers in `gpu_buffers.c` (only used in a few places now) - [ ] Submitting vertex buffers to the GPU has some overhead. It may be possible to do those copies asynchronously in the driver. ## Masks Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it. Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay. Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once. ## Consolidate vertex loops There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc. Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses. ## Bounding box frustum tests Sculpt tools that take into account the frustum only use 4 clipping planes, we should add another plane to clip nodes behind the camera. But unlike drawing, don't do use clip end and always have clip start equal to 0. Frustum - AABB intersection tests do not appear to be a bottleneck currently. But some possible optimizations here: * For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes * Multithreaded tree traversal, these tests are single threaded in most cases now * Cache visibility of nodes per viewport * Use center + half size instead of min + max for storing bounding boxes ## Threading - [x] It may be worth testing if the current settings for `BLI_parallel_range_settings_defaults` are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static. Changed now to remove node limit and use dynamic scheduling with chunk size 1, gave about a 10% performance improvement. For a high number of nodes it may be worth increasing the chunk size. ## Symmetry For X symmetry we currently do 2 loops over all vertices, and then do another loop to merge them. These 3 could perhaps be merged into one loop, though code might become significantly more complicated as every brush tool may need to code to handle symmetry. ## Low level optimizations Overall, this kind of optimization requires carefully analyzing code that runs per mesh element, and trying to make it faster. Sculpt tools support many settings, and the number of functions calls, conditionals and following of pointers adds up. It can be worth testing what happens when most of the code is removed, what kind of overhead there is. It can help to copy some commonly used variables onto the stack functions, ensuring that they can stay in registers and avoiding pointer aliasing. Test that check multiple variables could be precomputed and the result stored in a bitflag. More functions can be inlined in some cases. For example bmesh iterators used for dyntopo go through function pointers and function calls, while they really can be a simple double loop over chunks and the elements within the chunks. ## PBVH building Building the PBVH is not the most performance critical since it only happens when entering sculpt mode, but there is room for optimization anyway. The most obvious one is multithreading. ## Brush radius bounds Culling of nodes outside the brush radius is disabled for 2D Falloff: ``` bool sculpt_search_circle_cb(PBVHNode *node, void *data_v) { ... return dist_sq < data->radius_squared || 1; } ``` Elastic Deform has no bounds, but it may be possible to compute some even if they are bigger than the brush radius. ## Memory allocations for all vertices Some sculpt tools allocate arrays the size of all vertices for temporary data. For operations that are local, it would be better to allocate arrays per PBVH node when possible. In some cases this might make little difference, virtual memory pages may be mapped on demand until there are actual reads/writes (though this is not obviously guaranteed for all allocators and operating systems?). Also regarding coherent memory access, this could improve performance, if vertices are grouped per node as described above. ## Undo Undo pushes all nodes that are whose bounding boxes are within the brush radius. However that doesn't mean any vertices in that node are actually affected by the brush. In a simple test painting on a sphere, it pushed e.g. 18 nodes but only actually modified 7. We can reduce undo memory by delaying the undo push until we know any vertices within the node are about to be modified, though this may have a small performance impact. Ideally this would take into account both the brush radius test and masking/textures. Similarly, we also sometimes call `BKE_pbvh_node_mark_redraw` or `BKE_pbvh_node_mark_normals_update` for nodes without checking if any vertices within have actually been modified.
Author
Member

Added subscriber: @PabloDobarro

Added subscriber: @PabloDobarro

Added subscriber: @brecht

Added subscriber: @brecht

The point of the PBVH is to be able to do partial updates quickly. If doing many partial updates is someone significantly slower than updating the mesh as a whole, that is something to be fixed. There is no good reason for it to be slower.

The solution should not be to take some separate code path that updates the mesh as a whole, but rather fixing the bottleneck in the partial updates.

The point of the PBVH is to be able to do partial updates quickly. If doing many partial updates is someone significantly slower than updating the mesh as a whole, that is something to be fixed. There is no good reason for it to be slower. The solution should not be to take some separate code path that updates the mesh as a whole, but rather fixing the bottleneck in the partial updates.

Added subscriber: @ErickNyanduKabongo

Added subscriber: @ErickNyanduKabongo

Added subscriber: @item412

Added subscriber: @item412

Added subscriber: @CMC

Added subscriber: @CMC

Added subscriber: @tiagoffcruz

Added subscriber: @tiagoffcruz

This issue was referenced by c931a0057f

This issue was referenced by c931a0057ffea26175a2dc111718e5f3590b00f8

Added subscriber: @ReguzaEi

Added subscriber: @ReguzaEi

Some profiles from a 3 million poly mesh after the latest optimizations.

Running single threaded with -t 1. The multithreaded one is not as readable as a screenshot, but the hotspots are similar.

Large draw brush. Bottleneck is mainly the sculpting itself, with symmetry here.

sculpt_perf_large_brush.png

Mesh filter. Clearly normal update is the problem here. Not using atomics there make it 2x faster overall, but also can give wrong results then.
sculpt_perf_filter.png

The impact incoherent memory access is not possible to see in profiles like this, but it's probably worth trying to hack together some code for that and evaluate how much it helps, and then see if it's worth implementing properly.

Some profiles from a 3 million poly mesh after the latest optimizations. Running single threaded with `-t 1`. The multithreaded one is not as readable as a screenshot, but the hotspots are similar. **Large draw brush**. Bottleneck is mainly the sculpting itself, with symmetry here. ![sculpt_perf_large_brush.png](https://archive.blender.org/developer/F7779796/sculpt_perf_large_brush.png) **Mesh filter**. Clearly normal update is the problem here. Not using atomics there make it 2x faster overall, but also can give wrong results then. ![sculpt_perf_filter.png](https://archive.blender.org/developer/F7779793/sculpt_perf_filter.png) The impact incoherent memory access is not possible to see in profiles like this, but it's probably worth trying to hack together some code for that and evaluate how much it helps, and then see if it's worth implementing properly.

Added subscriber: @AlbertoVelazquez

Added subscriber: @AlbertoVelazquez

Added subscriber: @Josephbburg

Added subscriber: @Josephbburg

Added subscriber: @s12a

Added subscriber: @s12a

Added subscriber: @PawelLyczkowski-1

Added subscriber: @PawelLyczkowski-1

Added subscriber: @ClinToch

Added subscriber: @ClinToch
Member

Added subscriber: @TomMusgrove

Added subscriber: @TomMusgrove
Member

@PabloDobarro - another performance suggestion for sculpt/paint is to maintain a lower resolution version of what you are working on that is updated and rendered immediately as the stroke occurs; then the stroke is applied to the higher resoltution version of the mesh/image in seperate threads and they are rendered and replace the low res rendering as they are completed. This can reduce the amount of mesh and image data kept in memory or allow meshes/images that would greatly exceed memory; and allow compression of the parts of the mesh/image not in use.

The lower res object and image data can use about 1/4 to 1/8 the memory of the full object and images (or even drastically less for large images that are zoomed out); and then only the chunks of mesh data and image data that are actively being changed need to be kept in memory. Which chunks are needed are fairly predictible based on stroke direction, so loading and unloading them shouldn't introduce lag.

@PabloDobarro - another performance suggestion for sculpt/paint is to maintain a lower resolution version of what you are working on that is updated and rendered immediately as the stroke occurs; then the stroke is applied to the higher resoltution version of the mesh/image in seperate threads and they are rendered and replace the low res rendering as they are completed. This can reduce the amount of mesh and image data kept in memory or allow meshes/images that would greatly exceed memory; and allow compression of the parts of the mesh/image not in use. The lower res object and image data can use about 1/4 to 1/8 the memory of the full object and images (or even drastically less for large images that are zoomed out); and then only the chunks of mesh data and image data that are actively being changed need to be kept in memory. Which chunks are needed are fairly predictible based on stroke direction, so loading and unloading them shouldn't introduce lag.

Added subscriber: @ArtemBataev

Added subscriber: @ArtemBataev

Added subscriber: @SamGreen

Added subscriber: @SamGreen

Added subscriber: @FedericoExposito

Added subscriber: @FedericoExposito

Added subscriber: @DirSurya

Added subscriber: @DirSurya

Added subscriber: @pauanyu_blender

Added subscriber: @pauanyu_blender

Removed subscriber: @DirSurya

Removed subscriber: @DirSurya

Added subscriber: @Skleembof

Added subscriber: @Skleembof

Added subscriber: @ZackMercury-2

Added subscriber: @ZackMercury-2

Added subscriber: @Francis_J

Added subscriber: @Francis_J

Added subscriber: @HaroldRiverolEchemendia

Added subscriber: @HaroldRiverolEchemendia

Added subscriber: @Wesley-Rossi

Added subscriber: @Wesley-Rossi

Added subscriber: @Canucklesandwich

Added subscriber: @Canucklesandwich

Added subscriber: @DARRINALDER

Added subscriber: @DARRINALDER

Added subscriber: @E.Meurat

Added subscriber: @E.Meurat
Julien Kaspar added this to the Sculpt, Paint & Texture project 2023-02-08 10:48:53 +01:00
Philipp Oeser removed the
Interest
Sculpt, Paint & Texture
label 2023-02-10 09:12:49 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
27 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#68873
No description provided.