Page MenuHome

OpenSubdiv GPU acceleration
Confirmed, NormalPublicTO DO

Assigned To
None
Authored By
Dalai Felinto (dfelinto)
Aug 21 2019, 4:22 PM
Tokens
"The World Burns" token, awarded by blueprintrandom."Love" token, awarded by Schamph."Like" token, awarded by mysticfall."Love" token, awarded by Nominous."Love" token, awarded by strangerman."Love" token, awarded by tiagomeraki."Love" token, awarded by Blendork."Love" token, awarded by lrevardel."Love" token, awarded by mindinsomnia."Burninate" token, awarded by filiperino."Burninate" token, awarded by kivig."Love" token, awarded by silex."Love" token, awarded by Shimoon."Burninate" token, awarded by Miraste."Love" token, awarded by Alumx."Love" token, awarded by 295032."Burninate" token, awarded by FrankMartin."Love" token, awarded by mix1189."Love" token, awarded by nunoconceicao."Burninate" token, awarded by Memento."Burninate" token, awarded by MichaelWeisheim."Burninate" token, awarded by gritche."Burninate" token, awarded by Frozen_Death_Knight."Burninate" token, awarded by billreynish."Burninate" token, awarded by postrowski."Yellow Medal" token, awarded by epieter."Love" token, awarded by CobraA."Burninate" token, awarded by ogotay."Love" token, awarded by symstract."Burninate" token, awarded by -L0Lock-."Love" token, awarded by mantissa."Pterodactyl" token, awarded by filibis."The World Burns" token, awarded by 1seby."Burninate" token, awarded by codygo."Love" token, awarded by Kickflipkid687."Burninate" token, awarded by Mephisto."The World Burns" token, awarded by Sunbeam."Burninate" token, awarded by madminstrel."Burninate" token, awarded by BlackRainbow."Love" token, awarded by michaelknubben."Burninate" token, awarded by serhatergen."Burninate" token, awarded by czerw."Party Time" token, awarded by szap."Burninate" token, awarded by mazigh."Burninate" token, awarded by 51423benam."Burninate" token, awarded by Astiero."Burninate" token, awarded by Polygreen."Burninate" token, awarded by EAW."Dat Boi" token, awarded by koloved."Love" token, awarded by DaPaulus."Love" token, awarded by Tvartiainen."Burninate" token, awarded by wilsman77."Burninate" token, awarded by reanimate."Love" token, awarded by andruxa696."Love" token, awarded by cruelandunusual."Burninate" token, awarded by Kronk."Love" token, awarded by PiloeGAO."Burninate" token, awarded by aditiapratama."Evil Spooky Haunted Tree" token, awarded by z01nk."Burninate" token, awarded by Draise."Burninate" token, awarded by 3Rton."Love" token, awarded by snubilo."Love" token, awarded by mistajuliax."100" token, awarded by Zino."Burninate" token, awarded by amonpaike."Love" token, awarded by brilliant_ape."Love" token, awarded by realeyez."Love" token, awarded by lucky3."Love" token, awarded by bnzs."Love" token, awarded by daven."Love" token, awarded by xdanic.

Description

Status: Needs to be formatted as a project once there is someone to tackle this. Including use cases, milestones, task breakdown, etc.


We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results.

Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU.

  • Note that this needs a more indepth design and discussion between the GPU/Viewport team and @Sergey Sharybin (sergey) to make a design

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Seby (1seby) added a subscriber: Seby (1seby).

May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were:

  • to enable rendering of SubD models at much higher speed and subdivision levels without the increased memory footprint
  • to add support for proper creasing on SubD models

Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all).

I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is?

Thanks again for all your ongoing work on this great piece of open source software.

The current SubD modifier slows everything down.
In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!

lucas veber (lucky3) added a comment.EditedFeb 19 2020, 4:49 PM

Why is it lower priority than T68908?
This would allow realtime playback and realtime posing (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example.
Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.

Fast SubDiv is a standard that is difficult to give up on.

For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient.

Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions.

There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.

To be clear, there are multiple performance projects for 2020:

  • Faster high-poly mesh editing
  • Faster animation playback
  • Faster object mode performance

These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.

I'm removing the last line from the description since it only adds confusion and is not accurate in general, it depends on the specific use case. For some heavy rigs subdivision surfaces might not be the first concern, for other rigs it may be what is holding back performance.

This comment was removed by lucas veber (lucky3).

There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task.

P.S. Bump comment an hour after state update question. Comon ;)

This comment was removed by lucas veber (lucky3).

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons.

basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time
then use this data to 'emit' a patch that is skewed on the quad

for triangle I use barycentric transform to compare the points

I use this method currently to emit meshes and join them - (expensive!)

but if one simply created a new mesh doing the same stuff in C it should be really really fast.

side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier
https://www.youtube.com/watch?v=E3KGYRnTLak

The "bump" comment witnesses the popularity of this request

That is not how we prefer to work though. For the popularity there are tokens.
Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation.

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier.

There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.

Hello,

Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this...

I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience.
I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion.
I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time.

A quick disclaimer about me, since this is my first post here.
I am not affiliated in any form with Blender and its development.
I am just an hobbits that wants to understand a bit more about the internals of this wonderful project.
I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general.

Back to Opensubdiv on GPU:
The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration.
Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped.

These are some of the problems I have encountered:

A. Limited use of Opensubdiv:
Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process).
For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv.
This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!).
I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv.

B. Batching:
All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts.
The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends).

So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down).
In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless.

This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations).
After that, the Opensubdiv routines can be called in few shots.
However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures.

With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task".
In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision).
I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls.
However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better).

Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation.

C. GPU allocator
I then realized there was an issue was the cudaMalloc allocator.
For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done).
This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator.
There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers.
I will try this later if I have some time.

Conclusions:
Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv.
Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv).

Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop.
At the bottom of the plot you can see the GPU activity.
"Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU.
"Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above).
You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision.

I hope you found this interesting, let me know if you have some questions.
I can also push a branch somewhere if someone is interested, I would need to do some cleanup before.
Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes.

Regards,

JQuiver

@Jack Quiver (jackquiver) Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here:
https://developer.blender.org/differential/diff/create/
According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.

The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime.

I'm not sure we would add an approach that relies on such transfers.

Thanks for the feedbacks.

Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal.
However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display.

Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here https://github.com/PixarAnimationStudios/OpenSubdiv/blob/9be5a298270a42d89d7925c8cdd9e335869ec6ce/opensubdiv/osd/cudaKernel.cu#L259). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago.

I wonder what it means to Blender in relying on the Opensubdiv library as it is.
I would like to hear some opinions.

Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU.

I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.

So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader

for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback.

https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl

maybe we can setup a way users can do this in py if it's not going to be supported in master?

I was thinking that each material could save it's shader sources

material.vert_shader
material_frag_shader

and we can have a comment in there to find / replace

at the top near the vertex uniforms
//Begin_Vert_unforms

#just before 'final output' we place our offsets here (skinning etc)
//Begin_Vert_offsets

then the user can find / replace these comments with working code / compile the shader and replace the material with it.