I am working on a presentation about writing code to boost execution performance. I have chosen lattice_deform as a test-ground for this.
This patch is the result of several experiments to increase the execution performance of lattice deformation.
1. Adds test-cases to compare the effect with the old implementation. The tests differs by the number of verts to transform and the batch size.
2. The old implementation calculated one vert, and released static data that can be shared with other verts.
3. Using branchless code tricks to minimize the branches.
5. Use phased approach to reduce inner lop complexity.
4. Use batching to reduce the memory cache demand.