## Updating smooth normals
`BKE_pbvh_update_normals` takes up 50% of of time in some cases. The use of atomics in `pbvh_update_normals_accum_task_cb` to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities:
* Figure out which vertices are shared with other nodes, and only use atomics for those.
* Store adjacency info for the entire mesh, and gather normals per vertex.
* Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics.
* Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step.
Smoothing tools may also be able to benefit from this, to work with the overhead of storing adjacency info of the entire mesh.
## Coherent memory access
Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access.
Two things we can do here:
* Ensure vertex and face indices in a node are at least sorted.
* Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes.
## Clipping and partial redraw
## Draw buffers
* Use `GPU_vertbuf_raw_step` to reduce overhead of creating buffers in `gpu_buffers.c`
* Submitting vertex buffers to the GPU has some overhead. It may be possible to do those copies asynchronously in the driver.
Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it.
Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay.
Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once.
## Consolidate vertex loops
There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc.
Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses.
## Bounding box frustum tests
This does not appear to be a bottleneck currently. But some possible optimizations here:
* For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes
* Multithreaded tree traversal, these tests are single threaded in most cases now
* Cache visibility of nodes per viewport
It may be worth testing if the current settings for `BLI_parallel_range_settings_defaults` are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static.