## Updating smooth normals
`BKE_pbvh_update_normals` takes up 50% of of time in some cases. The use of atomics in `pbvh_update_normals_accum_task_cb` to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities:
* Figure out which vertices are shared with other nodes, and only use atomics for those.
* Store adjacency info for the entire mesh, and gather normals per vertex.
* Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics.
* Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step.
Smoothing tools may also be able to benefit from this, to work without the overhead of storing adjacency info of the entire mesh.
Depending if the brush tool needs normals, we could delay updating normals outside of the viewport.
## Coherent memory access
Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access.
Two things we can do here:
* Ensure vertex and face indices in a node are at least sorted.
* Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes.
## Partial redraw
* For symmetry, the region covered by partial redraw can become arbitrarily big. We could add a test to see if one side is offscreen entirely. Other than that, we'd need to define multiple regions, which could then be used for culling the PBVH nodes. Drawing the viewport one time for each region is likely slow though, so it could be rendered at full viewport resolution but blit multiple separate regions.
## Draw buffers
[ ] Use `GPU_vertbuf_raw_step` to reduce overhead of creating buffers in `gpu_buffers.c` (only used in a few places now)
[ ] Submitting vertex buffers to the GPU has some overhead. It may be possible to do those copies asynchronously in the driver.
Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it.
Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay.
Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once.
## Consolidate vertex loops
There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc.
Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses.
## Bounding box frustum tests
Sculpt tools that take into account the frustum only use 4 clipping planes, we should add another plane to clip nodes behind the camera. But unlike drawing, don't do use clip end and always have clip start equal to 0.
Frustum - AABB intersection tests do not appear to be a bottleneck currently. But some possible optimizations here:
* For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes
* Multithreaded tree traversal, these tests are single threaded in most cases now
* Cache visibility of nodes per viewport
* Use center + half size instead of min + max for storing bounding boxes
[X] It may be worth testing if the current settings for `BLI_parallel_range_settings_defaults` are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static.
Changed now to remove node limit and use dynamic scheduling with chunk size 1, gave about a 10% performance improvement. For a high number of nodes it may be worth increasing the chunk size.
For X symmetry we currently do 2 loops over all vertices, and then do another loop to merge them. These 3 could perhaps be merged into one loop, though code might become significantly more complicated as every brush tool may need to code to handle symmetry.
## Low level optimizations
Overall, this kind of optimization requires carefully analyzing code that runs per mesh element, and trying to make it faster.
Sculpt tools support many settings, and the number of functions calls, conditionals and following of pointers adds up. It can be worth testing what happens when most of the code is removed, what kind of overhead there is.
It can help to copy some commonly used variables onto the stack functions, ensuring that they can stay in registers and avoiding pointer aliasing. Test that check multiple variables could be precomputed and the result stored in a bitflag.
More functions can be inlined in some cases. For example bmesh iterators used for dyntopo go through function pointers and function calls, while they really can be a simple double loop over chunks and the elements within the chunks.
## PBVH building
Building the PBVH is not the most performance critical since it only happens when entering sculpt mode, but there is room for optimization anyway. The most obvious one is multithreading.
## Brush radius bounds
Culling of nodes outside the brush radius is disabled for 2D Falloff:
bool sculpt_search_circle_cb(PBVHNode *node, void *data_v)
return dist_sq < data->radius_squared || 1;
Elastic Deform has no bounds, but it may be possible to compute some even if they are bigger than the brush radius.
## Memory allocations for all vertices
Some sculpt tools allocate arrays the size of all vertices for temporary data. For operations that are local, it would be better to allocate arrays per PBVH node when possible.
In some cases this might make little difference, virtual memory pages may be mapped on demand until there are actual reads/writes (though this is not obviously guaranteed for all allocators and operating systems?).
Also regarding coherent memory access, this could improve performance, if vertices are grouped per node as described above.
Undo pushes all nodes that are whose bounding boxes are within the brush radius. However that doesn't mean any vertices in that node are actually affected by the brush. In a simple test painting on a sphere, it pushed e.g. 18 nodes but only actually modified 7.
We can reduce undo memory by delaying the undo push until we know any vertices within the node are about to be modified, though this may have a small performance impact. Ideally this would take into account both the brush radius test and masking/textures.
Similarly, we also sometimes call `BKE_pbvh_node_mark_redraw` or `BKE_pbvh_node_mark_normals_update` for nodes without checking if any vertices within have actually been modified.