Page MenuHome

Continue rendering after CUDA-Crash
Needs RevisionPublic

Authored by Benjamin Meyer (Anvilarion) on Jul 22 2019, 9:04 PM.

Details

Summary

After getting a lot of errors of the cuCtxSynchronize()-kind (the likes of T67350 etc.) , I decided to look into the code where these errors come from. Even though I could not fix them, it struck me as odd, that Blender would crash, even though the final image could still be rendered without the GPU (albeit it would take more time). So I implemented a way for the GPU to report a fatal error and thus being excluded from further rendering, instead of crashing Blender and discarding all made progress.

I would of course be willing to make adjustments based on your feedback if you are interested in this kind of error handling.

In addition to that I copied over the comments from my cuda.h-File to better describe (and complete) the CUDA-Errors.

Diff Detail

Event Timeline

That is indeed not cool that Blender crashes and fixing those is absolutely a good thing.

I don't think it is enough to report failures to the terminal. Terminal window is not easily accessibly on Linux and Mac unless Blender was already started from a terminal (which isn't the case for most users), and on Windows users don't tempt to check console messages either.
Such things are to be reported to the interface, where users can see this.

I am also on a split about feeling this is a right direction to go. On the one hand it's indeed better to simply kick failed device out of rendering, instead of stopping rendering, making user to disable GPU and switch to CPU and restart rendering (which will do all those synchronization steps again).
On another hand, there are render farms which provide an option to render on GPU. I don't find it's a great idea to make renders way slower for such cases, is better to abort immediately, inform user that his scene can not be rendered on GPU.
Maybe as an option, which is default for regular users but disabled by the farms?

In addition to that I copied over the comments from my cuda.h-File to better describe (and complete) the CUDA-Errors.

Unfortunately, you can not do it this way. This is an automatically generated wrangler, all manual edits will be lost.
This would require making changes to the generator script, but i am not sure how to do actually because the script tokenizes pre-processed headers and all comments are lost by the preprocessor. Probably, rewriting the generating script to use clang's tokenizer will be a solution here.

Anyway, the main repository for the wrangler is here https://github.com/CudaWrangler/cuew

I don't think it is enough to report failures to the terminal. Terminal window is not easily accessibly on Linux and Mac unless Blender was already started from a terminal (which isn't the case for most users), and on Windows users don't tempt to check console messages either.
Such things are to be reported to the interface, where users can see this.

You mean like the overlay boxes which appear for example when Blender encounters an error in a python script? Or do you have something else in mind?

Maybe as an option, which is default for regular users but disabled by the farms?

I wrote a small program which (among other things) starts Blender via cli and restarts the process when CUDA crashes and the render process is below a certain threshold. Maybe such a threshold might also be an option.

In addition to that I copied over the comments from my cuda.h-File to better describe (and complete) the CUDA-Errors.

Unfortunately, you can not do it this way. This is an automatically generated wrangler, all manual edits will be lost.
This would require making changes to the generator script, but i am not sure how to do actually because the script tokenizes pre-processed headers and all comments are lost by the preprocessor. Probably, rewriting the generating script to use clang's tokenizer will be a solution here.
Anyway, the main repository for the wrangler is here https://github.com/CudaWrangler/cuew

The last commit says "Update wrangler to latest CUDA Toolkit 9.2". This at least explains why my cuda.h has more error codes than the cuew.h in Blender. Wouldn't it be possible to simply make a verbatim copy of the enum instead of reconstructing it from an AST? As far as I see errors are handled in a special way anyway.

To avoid a deadlock: What is the process here now? The way I see it is that I have to wait for the verdict of the other reviewers you added, right?

You mean like the overlay boxes which appear for example when Blender encounters an error in a python script? Or do you have something else in mind?

Using reports system is what i had in mind. Similar to how hitting F12 in a scene without camera shows.

On a closer look it seems that it's simply enough to keep the code around error_msg in cuda_assert.
That way error message will be reported via Device::error_message, which via some indirections gets reported to the interface (Session does progress.set_error(), which then ends up in b_engine.error_set()).

I wrote a small program which (among other things) starts Blender via cli and restarts the process when CUDA crashes and the render process is below a certain threshold. Maybe such a threshold might also be an option.

What i was trying to say: render farms should be able to make Blender to stop rendering as soon as GPU fails with any error instead of becoming slower and wasting a lot of users time credits.

Wouldn't it be possible to simply make a verbatim copy of the enum instead of reconstructing it from an AST? As far as I see errors are handled in a special way anyway.

It is certainly possible, but i am not sure why this enum deserves more attention than any of other or any of function. It is a wrangler after all, for the development you really need to consult development documentation.

I would say either find a way to preserve all comments, or simply not bother with this. Is not like OpenGL wrangler contains comments about OpenGL functions, and this has never been a problem.

To avoid a deadlock: What is the process here now? The way I see it is that I have to wait for the verdict of the other reviewers you added, right?

I wouldn't mind hearing others opinion here. Personally, i think if the reporting to the interface is restored/ensured and there is a way for render farms to stop rendering it's fine to have this.

Brecht Van Lommel (brecht) requested changes to this revision.Jul 23 2019, 4:33 PM

I don't think this is the right direction to go in. If CUDA reports an error, then rendering should stop immediately. Ignoring these kinds of errors is going to make performance really unpredictable for users and will make it harder for us get good bug reports to fix the underlying issues.

We should figure out why Blender requires a restart when certain CUDA errors happen and try to fix that. And fix whatever issues are causing these CUDA errors.

This revision now requires changes to proceed.Jul 23 2019, 4:33 PM

We should figure out why Blender requires a restart when certain CUDA errors happen and try to fix that.

To answer this question: It says in the above comments (which is the source for the list in isFatal):

/*
While executing a kernel, the device encountered a
load or store instruction on an invalid memory address.
This leaves the process in an inconsistent state and any further CUDA work
will return the same error. To continue using CUDA, the process must be terminated
and relaunched.

*/
CUDA_ERROR_ILLEGAL_ADDRESS = 700,

So fixing that would require running CUDA in its own process which can be terminated without relaunching Blender.

Crazy idea: introduce cuewExit()which will unload the CUDA library, and then do cuewInit()again. Maybe that will make CUDA usable without restarting Blender?

Crazy idea: introduce cuewExit()which will unload the CUDA library, and then do cuewInit()again. Maybe that will make CUDA usable without restarting Blender?

I'll try this right away. If this is the solution to all the CUDA-Errors it sounds far too simple. :-)