NVIDIA issue prevents full indirect draw call batching performance #70011

Closed
opened 2019-09-18 12:01:29 +02:00 by Sergey Sharybin · 57 comments

System Information
Operating system: Linux-5.2.0-2-amd64-x86_64-with-debian-bullseye-sid 64 Bits
Graphics card: Quadro RTX 6000/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 430.40

Blender Version
Broken: version: 2.81 (sub 11), branch: master, commit date: 2019-09-18 08:49, hash: e2cbf8b117
Worked: (optional)

Short description of error
Going to edit mode makes links in node editor to disappear.

Exact steps for others to reproduce the error
Open attached file, go to object's edit mode.

Seems something to do with /* Add link to batch. */ code path. Forcing /* Draw single link. */ code math makes all links to properly display.

untitled.blend

**System Information** Operating system: Linux-5.2.0-2-amd64-x86_64-with-debian-bullseye-sid 64 Bits Graphics card: Quadro RTX 6000/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 430.40 **Blender Version** Broken: version: 2.81 (sub 11), branch: master, commit date: 2019-09-18 08:49, hash: `e2cbf8b117` Worked: (optional) **Short description of error** Going to edit mode makes links in node editor to disappear. **Exact steps for others to reproduce the error** Open attached file, go to object's edit mode. Seems something to do with `/* Add link to batch. */` code path. Forcing `/* Draw single link. */` code math makes all links to properly display. [untitled.blend](https://archive.blender.org/developer/F7755205/untitled.blend)
Author
Owner

Added subscriber: @Sergey

Added subscriber: @Sergey

#70311 was marked as duplicate of this issue

#70311 was marked as duplicate of this issue

#70275 was marked as duplicate of this issue

#70275 was marked as duplicate of this issue

#70271 was marked as duplicate of this issue

#70271 was marked as duplicate of this issue

#70212 was marked as duplicate of this issue

#70212 was marked as duplicate of this issue

#70097 was marked as duplicate of this issue

#70097 was marked as duplicate of this issue

#70008 was marked as duplicate of this issue

#70008 was marked as duplicate of this issue

#70024 was marked as duplicate of this issue

#70024 was marked as duplicate of this issue

Added subscribers: @WackyAman, @RobTuytel

Added subscribers: @WackyAman, @RobTuytel

Added subscriber: @brecht

Added subscriber: @brecht

This is only a display issue, but highly confusing. It should be fixed asap.

This is only a display issue, but highly confusing. It should be fixed asap.
Clément Foucault was assigned by Brecht Van Lommel 2019-09-18 17:31:38 +02:00

This was caused by 3a08153d7a.

This was caused by 3a08153d7a.

I cannot reproduce on any of my setups (even linux + nv430.40). I'm trying to figure out what's happening.

I cannot reproduce on any of my setups (even linux + nv430.40). I'm trying to figure out what's happening.

Added subscribers: @alonabrany, @fclem, @JacquesLucke

Added subscribers: @alonabrany, @fclem, @JacquesLucke

Ok both Linux + GTX 960M (driver 430.40) and Win 10 x64 + GTX 960 with latest drivers are working correctly. So I'm guessing it's an issue with the RTX card driver.

The issue is caused by glMultiDrawElementsIndirect & glMultiDrawArraysIndirect which leads to apparent issue in rendering instancing drawcalls (glDrawElementsInstanced, glDrawArraysInstanced and the like)

Cleaning up the mapping does not fix it. (unmapping before changing context or other draw)
Removing the MultiDrawIndirect calls fixes it.

I'll report the issue to NVidia.

EDIT: Something I forgot to mention: No error is generated by the GL either by the debug facility (--debug-gpu) or by manually querying glGetError().

Ok both Linux + GTX 960M (driver 430.40) and Win 10 x64 + GTX 960 with latest drivers are working correctly. So I'm guessing it's an issue with the RTX card driver. The issue is caused by glMultiDrawElementsIndirect & glMultiDrawArraysIndirect which leads to apparent issue in rendering **instancing** drawcalls (glDrawElementsInstanced, glDrawArraysInstanced and the like) Cleaning up the mapping **does not** fix it. (unmapping before changing context or other draw) Removing the MultiDrawIndirect calls **fixes it**. I'll report the issue to NVidia. EDIT: Something I forgot to mention: No error is generated by the GL either by the debug facility (--debug-gpu) or by manually querying glGetError().
Author
Owner

So I'm guessing it's an issue with the RTX card driver.

Jacques is using GTX card as far as i know though.

> So I'm guessing it's an issue with the RTX card driver. Jacques is using GTX card as far as i know though.
Member

Added subscribers: @isvart86, @crantisz

Added subscribers: @isvart86, @crantisz

Added subscribers: @smramsay, @lichtwerk

Added subscribers: @smramsay, @lichtwerk

Added subscriber: @ChrisGraz

Added subscriber: @ChrisGraz

Added subscriber: @TomaszMuszynski

Added subscriber: @TomaszMuszynski

Added subscriber: @StephenSwaney

Added subscriber: @StephenSwaney

Happens here on Linux GTX 1660 with Nvidia driver

Happens here on Linux GTX 1660 with Nvidia driver

Added subscriber: @albo

Added subscriber: @albo

Happens here on Win10 RTX2070 driver 431.53

Happens here on Win10 RTX2070 driver 431.53

Added subscriber: @mugpet

Added subscriber: @mugpet

Added subscribers: @Insertzs, @Russ1642

Added subscribers: @Insertzs, @Russ1642

@fclem, please commit whatever workaround is needed to fix this issue, even if it means disabling some of the batching optimizations for NVIDIA. We can't leave an issue with this priority broken for so long in master.

@fclem, please commit whatever workaround is needed to fix this issue, even if it means disabling some of the batching optimizations for NVIDIA. We can't leave an issue with this priority broken for so long in master.

This comment was removed by @ChrisGraz

*This comment was removed by @ChrisGraz*

In #70011#785398, @ChrisGraz wrote:
Select NVIDIA-Color Settings and 10 bit Color instead of the Standard Color Settings in the NVIDIA Control Panel to fix this in Windows 10.

Graphics card: GeForce RTX 2070/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 431.86
Blender Version
Broken: version: 2.81 (sub 12), branch: master, commit date: 2019-09-27 22:44, hash: 1c1a3198af

Anmerkung 2019-09-28 183507.jpg

I'm using 10 bit colour with my RTX 2080 Ti and the bug is still there.

> In #70011#785398, @ChrisGraz wrote: > Select NVIDIA-Color Settings and 10 bit Color instead of the Standard Color Settings in the NVIDIA Control Panel to fix this in Windows 10. > > Graphics card: GeForce RTX 2070/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 431.86 > **Blender Version** > Broken: version: 2.81 (sub 12), branch: master, commit date: 2019-09-27 22:44, hash: `1c1a3198af` > > ![Anmerkung 2019-09-28 183507.jpg](https://archive.blender.org/developer/F7778792/Anmerkung_2019-09-28_183507.jpg) I'm using 10 bit colour with my RTX 2080 Ti and the bug is still there.

The committed workaround breaks meshes/objects. Trying to move the default cube in object mode just moves the origin and not the mesh.
Opening a file saved pre-commit moves all meshes (but not object origins) and empties to the world center.
Sorry if this isn't the place to report it and I'll open a new bug report if it's just on my end.
(Linux / 2070)

The committed workaround breaks meshes/objects. Trying to move the default cube in object mode just moves the origin and not the mesh. Opening a file saved pre-commit moves all meshes (but not object origins) and empties to the world center. Sorry if this isn't the place to report it and I'll open a new bug report if it's just on my end. (Linux / 2070)

Added a more restricted workaround in 24be998 due to problems in #70345 (viewport display not showing the correct object transform (location & scale) with the lastest nightly build).

Added a more restricted workaround in 24be998 due to problems in #70345 (viewport display not showing the correct object transform (location & scale) with the lastest nightly build).
Brecht Van Lommel changed title from Shader node link disappears when going to edit mode to Various display issues on NVIDIA after draw call batching 2019-09-29 21:14:39 +02:00

So we went for an aggressive workaround since it could also affect bones. The workaround was to disable indirect drawing for all Nvidia hardware. This has some performance issue but some of the batching performance gain are still in.

About the bug itself, note that only the next instanced drawcall (that is after some buggy indirect drawcall), is not working.
The indirect drawcall that seems to be responsible for the buggy behavior seems to be any drawcall inside the edit mesh pass (bypassing edit_mesh_draw_components fixes it). But for the other problems (other missing instanced drawcall) the culprit indirect drawcall is another.

We tried to replicate the bug in a small test program but couldn't. The issue must come from a special state setup or a special sequence of events.

So we went for an aggressive workaround since it could also affect bones. The workaround was to disable indirect drawing for all Nvidia hardware. This has some performance issue but some of the batching performance gain are still in. About the bug itself, note that only the next instanced drawcall (that is after some buggy indirect drawcall), is not working. The indirect drawcall that seems to be responsible for the buggy behavior seems to be any drawcall inside the edit mesh pass (bypassing edit_mesh_draw_components fixes it). But for the other problems (other missing instanced drawcall) the culprit indirect drawcall is another. We tried to replicate the bug in a small test program but couldn't. The issue must come from a special state setup or a special sequence of events.

Added subscriber: @Psy-Fi

Added subscriber: @Psy-Fi

Is it possible to capture the render state with something like Renderdoc? Might give you a few hints.
If you haven't tried it already, it might be an issue with alignment of the data in the indirect buffer (wild guess though).

Is it possible to capture the render state with something like Renderdoc? Might give you a few hints. If you haven't tried it already, it might be an issue with alignment of the data in the indirect buffer (wild guess though).

Removed subscriber: @ChrisGraz

Removed subscriber: @ChrisGraz

This issue is caused by the latest workaround:

  • #70466 (Assigning different materials to different parts of a Mesh cause them to be transparent.)

These issues were solved by it:

  • #70449 (commit 3a08153d7a introduces X11 error)
  • #70413 (2.81 crash on startup - does not get as far as the splash screen)
  • #70104 (Display as Wire makes other objects disappear)
  • #70031 (Crash in edit mode changing from x-ray)
This issue is caused by the latest workaround: * #70466 (Assigning different materials to different parts of a Mesh cause them to be transparent.) These issues were solved by it: * #70449 (commit 3a08153d7a8 introduces X11 error) * #70413 (2.81 crash on startup - does not get as far as the splash screen) * #70104 (Display as Wire makes other objects disappear) * #70031 (Crash in edit mode changing from x-ray)

@Psy-Fi We already tried renderdoc but there was nothing standing out.

Indirect buffer is aligned correctly as all indirect calls are rendered correctly on every platform that support it. We also tried uploading the indirect buffer using glBufferData or glBufferSubData but that changed nothing. So it's not buffer management.

@Psy-Fi We already tried renderdoc but there was nothing standing out. Indirect buffer is aligned correctly as all indirect calls are rendered correctly on every platform that support it. We also tried uploading the indirect buffer using glBufferData or glBufferSubData but that changed nothing. So it's not buffer management.

Added subscriber: @Phigon

Added subscriber: @Phigon

Lowering priority since there is no known breakage here, just a potential performance improvement by getting indirect calls to work on NVIDIA.

Lowering priority since there is no known breakage here, just a potential performance improvement by getting indirect calls to work on NVIDIA.
Brecht Van Lommel changed title from Various display issues on NVIDIA after draw call batching to NVIDIA issue prevents full indirect draw call batching performance 2019-10-03 14:26:12 +02:00

For 2.81 we can just keep this disabled and live with somewhat reduced performance, but still better than 2.80.

For 2.81 we can just keep this disabled and live with somewhat reduced performance, but still better than 2.80.

Removed subscriber: @Phigon

Removed subscriber: @Phigon

Added subscriber: @dgsantana

Added subscriber: @dgsantana

So NVidia replied and said it was indeed a known driver bug (affecting Turing GPUs) that will be fixed in the next iteration of their drivers (version 440) to be released early next month.

So we should whitelist this driver version before release preferably.

So NVidia replied and said it was indeed a known driver bug (affecting Turing GPUs) that will be fixed in the next iteration of their drivers (version 440) to be released early next month. So we should whitelist this driver version before release preferably.

On windows they already release 440 and 441. On Linux there is a 440 beta driver.

On windows they already release 440 and 441. On Linux there is a 440 beta driver.

Probably this is not important enough to risk for the 2.81 release, seems simpler to just postpone it to 2.82.

Probably this is not important enough to risk for the 2.81 release, seems simpler to just postpone it to 2.82.

Added subscriber: @ValZapod

Added subscriber: @ValZapod

@brecht @fclem Hey, that is not a good idea. At least maybe test it? What version of driver fixed that, you know. Also maybe try to check what connected issues were indeed because of those indirect calls. Maybe some of them used them wrong, then it is a problem, etc, etc. Also are you aware of a new Studio driver with optimized V-ray and Blender - [x]? Did you already check it?

Also it has some fixes - [x] in Optix, in particular:
[OptiX]: OptiX unable to select GPUs for use when all GPUs in the system are in TCC mode
[2722686]
[OptiX]: Applications built against OptiX SDK 7.0.0 throw error message when using
OptixBuildInputTriangleArray::preTransform. [2742029]
[OptiX denoiser]: Errors occur when using the denoiser and specifying Color + Albedo +
Normals.
[OptiX denoiser]: The Alpha-denoising path does not support blend mode. [2727732]
[OptiX denoiser]: OptiX errors occur when specifying Normals + Albedo for input. [2712572]
[OptiX]: noinline function calls do not use custom ABI. [2631716]
[OptiX]: denoiser device handling preempts user control. [2647842]
[OptiX denoiser]: OptiX denoiser produces small color values for black pixels and zero alpha.
[2545368]

Will be very happy if you polished Blender for Nvidia driver as Nvidia polished driver for you ;)


@brecht @fclem Hey, that is not a good idea. At least maybe test it? What version of driver fixed that, you know. Also maybe try to check what connected issues were indeed because of those indirect calls. Maybe some of them used them wrong, then it is a problem, etc, etc. Also are you aware of a new Studio driver with optimized V-ray and Blender - [x]? Did you already check it? Also it has some fixes - [x] in Optix, in particular: [OptiX]: OptiX unable to select GPUs for use when all GPUs in the system are in TCC mode [2722686] [OptiX]: Applications built against OptiX SDK 7.0.0 throw error message when using OptixBuildInputTriangleArray::preTransform. [2742029] [OptiX denoiser]: Errors occur when using the denoiser and specifying Color + Albedo + Normals. [OptiX denoiser]: The Alpha-denoising path does not support blend mode. [2727732] [OptiX denoiser]: OptiX errors occur when specifying Normals + Albedo for input. [2712572] [OptiX]: noinline function calls do not use custom ABI. [2631716] [OptiX]: denoiser device handling preempts user control. [2647842] [OptiX denoiser]: OptiX denoiser produces small color values for black pixels and zero alpha. [2545368] Will be very happy if you polished Blender for Nvidia driver as Nvidia polished driver for you ;) --- - [x] https://www.nvidia.com/en-us/geforce/forums/studio-drivers/39/329442/nvidia-studio-driver-44128-feedback-thread-release/ - [x] https://us.download.nvidia.com/Windows/441.28/441.28-win10-nsd-release-notes.pdf
Clément Foucault was unassigned by Dalai Felinto 2019-12-23 13:50:17 +01:00
Clément Foucault self-assigned this 2020-01-23 20:13:25 +01:00

Ok revisiting this one. I think we can enable it to all Nvidia GPUs now and display a warning about unsupported driver.

@brecht any issue with that?

Ok revisiting this one. I think we can enable it to all Nvidia GPUs now and display a warning about unsupported driver. @brecht any issue with that?

It depends if that driver update is available for all NVIDIA graphics cards we support. I'm guessing it's not.

Safest would be to check the driver version instead, does that require complicated code?

It depends if that driver update is available for all NVIDIA graphics cards we support. I'm guessing it's not. Safest would be to check the driver version instead, does that require complicated code?
Member

Added subscriber: @EAW

Added subscriber: @EAW
Member

@fclem Do you mean enabling the original version of 3a08153d7a without the work arounds?

@brecht If that is what @fclem is suggesting, then wouldn't the driver only have to be available for Turing aka RTX cards (which it is), since the original batch calling patch only produced errors on cards with this architecture? It is the workarounds that affected other architectures. With the work arounds removed, there shouldn't be an issue with cards with older drivers, right?

@fclem Do you mean enabling the original version of 3a08153d7a without the work arounds? @brecht If that is what @fclem is suggesting, then wouldn't the driver only have to be available for Turing aka RTX cards (which it is), since the original batch calling patch only produced errors on cards with this architecture? It is the workarounds that affected other architectures. With the work arounds removed, there shouldn't be an issue with cards with older drivers, right?

You're right, if it's only RTX cards this should be fine.

You're right, if it's only RTX cards this should be fine.

So final decision is to just remove the workaround right? @brecht

So final decision is to just remove the workaround right? @brecht

Yes.

Yes.

But do it in master (which I guess you already planned to do), since this is rather risky close to the 2.82 release.

But do it in master (which I guess you already planned to do), since this is rather risky close to the 2.82 release.

This issue was referenced by fd130a711e

This issue was referenced by fd130a711e99723924e99a60a9e11162ca278fd8

Changed status from 'Confirmed' to: 'Resolved'

Changed status from 'Confirmed' to: 'Resolved'
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
16 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#70011
No description provided.