Cycles: NVidia GTX 980 Ti rendering at 1/3rd speed of NVidia GTX 980 the same machine?
Closed, ResolvedPublic

Description

System Information
Win 8.1 x64, i7 5820K, 16GB ram
3X GTX 980 Ti

Blender Version
Broken: 2.74, 2.75RC1
Worked: (not yet)

Short description of error
I recently upgraded my primary workstation from 3x GTX 980 graphics cards to 3x GTX 980 Ti cards - but the performance of the new cards is actually about 1/3rd of the performance of the original 3x 980s. What's strange though - these cards score very well on all CUDA and OpenCL benchmarks I can find, but in Blender, these 3 cards working together actually sometimes underperform my single GTX 680. What's strangest is that these cards do show a tiny performance boost on the Blender 2.7 BMW benchmark scene over the previous hardware configuration, but in every scene which I design, they're performing much, much slower. There does seem to be a small correlation between which shaders are used in the scene, as some shaders seem to be effected worse that others. Ambient Occlusion for instance, is hardly effected at all relative to the 980s.

I have tested the new cards with the Blender 2.74, 2.75 RC1, and a few nightly builds since then, and this issue seems to effect all current versions.

Exact steps for others to reproduce the error
Using an NVidia GTX 980 Ti, render the following scene to see dramatically hindered performance relative to the hardware potential of the card.



Details

Type
To Do
There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi
I am a beginner in a blender and do not know what's going on. I bought new equipment and I thought that everything will be ok ...

3x Gigabyte GeForce GTX980TI
Intel Core i7-6700K, 4.0GHz, 8MB
RAM 64GB
Windows 10 Professional 64-bit

SLI 3xGTX80ti GPU
BMW test: 39 sec

Small scenes are going fast (GPU) but a large scene did not move…
Still the same error:
CUDA error Out of Memory in cuMemAlloc( device_pointer,met.memory_size)
CPU works but so slow.. :(
I need to make a big billboard (2,5x5m full fit) but I have to wait 1217 hours (CPU)
I have to render it in large format:


I don't know what I can do.. Did I set something wrong?

Thank you for your time!

@ Alicia

There are opinions that SLI is not good for rendering and that makes sens to me but I have no direct prove, I would try to inactivate SLI
I have 2 gpu and BMW renders 42 s non SLI

First you need to reduce all light path tracing parameters to minimum or just use direct light in render settings For such big scale you need less details. This picture is to be looked at from a distance?

also make tiles small enough for GPUs to render. What is the size you try to render? A tail must be small enough to feet GPUs RAM.

@Andy
Hi
Thank you for feedback ;)

Yes, picture will be looked from a distance about 5m
I use "pro lighting skies" lighting from Andrew Price (blender guru)
I learned that I need 14000x8000 72dpi ...
I inactivated SLI but still is a problem with "out of memory" scene did not move even in small resolution :(
This problem appears in my every projects even in simple interior bathroom (2880x1800) like this:


For example i rendered this bathroom on new macbook pro (20hours CPU) When i bought new computer i thought that will be faster(because this is 3x good card 6GB vram) but (GPU) doesn't move.. realy this is very simple scene..
I don't know what can I do.

Alicja (Alicja)
Textures take up VRAM, though you're running 3X 980ti, you only have 6gb total (less what windows is using).
A combination of texture size, particles and poly count add up quickly and can easily saturate 6gb.
If the scene is too big to fit within your 4-6gb limit the render will fail with the Cuda memory alloc error.
Split your scene up into layers, render with a transparent background, and composite them in the nodes compositor or photoshop/gimp if possible.

@Adam Majewski (adam) (-pXd-)
Hi Adam
Thank you for feedback ;)
I know that I have 6gb total :) I thought that would be enough compared to the previous computer.
In this bathroom is only two ( wood ann foliage texture...) not to big
I can't resize anything.. When I used less samples it was so noisy

@Alicia
GPU rendering is definitely the future. Just have a look at what Octanerender are doing. Otoy.com.

Blender seems to have issue with 980Ti. Three of them triple problem ??? :-)

tI have 980 and 980 ti.
I try BMW at the resolution you need 14000 x 8000 100%
200 samples
limited global illumination
tile size very important 160x120 there would be 5896 of them during render.
The system promises me to be ready in 3h 45 min. I am not going to wait.

at resolution scaled down to 14000 x 8000 50% system only wants 25 min to render 1496 tils

You need to play with render settings to start moving the render but keep the tiles 160x120 . When I tried other tile size i got system blocked

I try BMW at the resolution you need 14000 x 8000 100%
400 samples
limited global illumination
tile size very important 160x120 there would be 5896 of them during render.
The system promises me to be ready in 5h 5 min. I am not going to wait.

If the BMW can render your pics should move too. You have much less speculars.

This is really, really, really off-topic. You might get help at a regular forum not the bug reports forum.

Everyone: stop posting here unless you think you have information for the developer to help fixing the GTX Ti issue.

Ok. I'm sorry... :) I have a problem with slow 3x gtx980ti too but.. thanks everyone for help ;) I'm stop posting about earlier problem now ;) thanks thanks thanks!

@Ton Roosendaal (ton)
But the developer is silent. I for one, have the 2 GPUs and some free time. I could make some testing if there was something I could test on user level to help...

James Green (Jammer) added a comment.EditedApr 24 2016, 1:10 PM

I don't know if this will be helpful or not.

I used to have an EVGA GTX580 Classified installed in my main machine. Running any version of Blender on this machine was great. Windows always remained responsive, even when doing a fairly demanding final render.

This exact same machine has been upgraded from Windows 7 to Windows 10 and also upgraded from the GTX580 to a Palit GTX980ti Super JetStream.

Doing a final render on this updated configuration (all other hardware unchanged) and trying to do anything else on the machine isn't anywhere near as smooth. It's incredibly lumpy in fact.

This leads me to think that the issue might not solely lay with Blender but could be a conflation of things in addition to any issues Blender may have with GM200 powered graphics cards.

I'm going to be watching this thread with interest since I've been incredibly disappointed with the Blender performance ever since the upgrade to the GTX980ti.

Thanks for all the hard work on this and look forward to any results.

Not sure if this is of any help - but I have been testing tile sizes using 2.77 on a 980Ti and Windows 10. I rendered a 600x400 image at 500 samples using the normal path tracer.

I plotted the results X vs Y on a plot and there are some interesting results. I have posted the plot on this thread on the Blender artists forum (although I am in the process of expanding the data set).

http://blenderartists.org/forum/showthread.php?397423-Old-war-Windows-VS-Linux-(see-my-little-speed-comparison)-)&p=3040912&viewfull=1#post3040912

But to summarise:

  1. I get the best performance when my tile size equates to around 19,000 - 22,500 pixels per tile. The best performing tile sizes were 160x140, 110x200 and 200x110. There is a valley of good render times running through all tile sizes that equate to around 19,000 - 22,500 pixels per tile.
  2. Tiles that are integer divisible into the overall image resolution (e.g. 100x200) don't necessarily give me the best performance which is somewhat unexpected since partial tiles are supposed to carry a performance overhead. Some integer divisible tiles (e.g. 200x200) give very poor performance, however it's not a straight up case of "more pixels = worse performance" since tile sizes that have more pixels and which are also integer divisble give better render times (e.g. 300x200).
  3. There are islands of poor performance regardless of whether they are full or partial tiles (e.g. the areas around 200x200, 290x130). Tiles all around these 'islands' render faster.

Some tile size render times of note:

160x140 = 11.34

100x100 = 16.03
150x100 = 12.65
200x100 = 11.87
300x100 = 14.49

100x200 = 12.23
150x200 = 15.64
200x200 = 17.30
300x200 = 15.20

600x400 = 14.36

No activity here for awhile. Any progress on this?

The tile size graphs you created are true for my TitanX on Windows 10 also, it prefers relatively small tiles. I also have an old Titan 6Gb in the same machine which renders much faster than the TitanX. Particles, smoke, SSS and real-time viewport rendering (not tiled) take the biggest hit on the TitanX, where it grinds to a painfully slow speed compared with the old Titan.

hi

there is some thing really wrong at win10 bmw test runs 1,45 on titanx i did try 5 different nvidia driver no effect
then i did some oc to my card was running 1380mhz and memory 7400mhz and my time was 1.32 so no effect

i did install linux and blender same test 1,05

Roman (kOsMos) added a subscriber: Roman (kOsMos).EditedMay 16 2016, 2:00 AM

Obviously there is problem with cycles. Horrible GPU render performance on GTX Titan X with v2.77a. Basically Titan X is 127% slower than 780 Ti when rendering Mike Pan's BMW scene. But with v2.76 it is only 32% slower, still it should not be slower! I am not super expect but the first thing that comes to mind is most likely bad code in the kernel file kernel_sm_52.cubin OR the larger VRAM on 980ti and Titan X.

To prove that its infact cycles causing the degradation of performance of 980ti and Titan X, Octane render is 20-30% faster than 780Ti, the way it should be!

*160x120 Tiles seems to be the magical size! In version 2.77a Titan X finished this scene in 28s, but in v2.76 with same tile setting it finished in 30s. Still slower than 780Ti.

Windows 10 Pro
Scene: BMW1M-MikePan.blend
GPU1: Titan X vs GPU2: 780Ti
Blender Version: v2.77a vs v2.76
Tiles: 512x512

Results:

v2.77a
Titan X: 50s
780 Ti: 22s

v2.76
Titan X: 29s
780 Ti: 22s

@Andy (AndyZ), developers are silent because they are kind of in the middle of something. It is NOT a forum, flooding report would not make developers more active and will not lead to a faster bug fix.

Please don't use the bug tracker as a user-to-user communication, use BA forum instead and only put really helpful information here.

@Alicja (Alicja), this is out of the scope of this report.

@Lee Jones (moony), you can always gain some %% of speedup by fine-tuning parameters for a particular hardware. That's NOT what we're troubleshooting. While you say there are more optimal tile size it'll be more optimal for all the OSes (since it depends on hardware, basically number of threads on the GPU). Tweaking tile size will NOT change the fact that Windows 10 renders 3 times slower than Linux or Windows 7.

@Roman (kOsMos), We don't have Titan X here in the studio, but i can not confirm such a slowdown with 980Ti (which is the same Compute Capability 5.2) we do have here, both 2.76 and 2.77a behaves same slow.

To conclude

So what did a while ago was we've created a special build which avoids ANY of CPU interaction during GPU rendering, avoiding any possible latency (tile was fully sampled on GPU, only final tile result was reported back to CPU). This way we're sure we're loading GPU as much as possible.

This test was tested by @Adam (-pXd-) (only by him btw, nobody else even dared to do tests which are needed for further investigation). This did not give any measurable difference in the render time (if i read timing correct and comparing it to proper baseline). This means the root of the issue of render time difference between various platforms is not the way how we launch CUDA kernels, root of the issue is inside of the driver of OS itself.

We also went couple of versions back and compared render times between Windows 10 and Linux, and Linux was consistently faster (around 3 times) on the same hardware. (And as we tested before, Windows 7 was quite on the same level as Linux). So we can not confirm any claims that sm_52-based cards were faster in previous releases.

We also tested regular 980 card (NOT a Ti) on Windows 10 and Linux in the same machine. And surely enough it was much slower in Windows 10 again.

All this currently leads us to a conclusion that it's something fundamentally broken in either Windows 10 itself or NVidia's driver for this platform. This isn't something we can look ourselves (well, we could, but MS is not really happy about reverse-engineering their products ;). For until some major update happens from either Microsoft or NVidia sides i don't see what else we can do here currently.

P.S. Comparing Cycles to IRay is not really legit. IRay is specifically designed and optimized for CUDA architecture. Additionally, as far as i can see nobody compared IRay on Win10 and Linux, so you can't say IRay's performance is on it's maximum either, it might be same 3x times faster on Linux.
P.P.S. Again, even inoptimal design of Cycles Kernel does not cancel the fact that it is only slow on Windows 10 and that's it's much-much faster on Win7 and Linux.

@Andy (AndyZ), developers are silent because they are kind of in the middle of something. It is NOT a forum, flooding report would not make developers more active and will not lead to a faster bug fix.

Please don't use the bug tracker as a user-to-user communication, use BA forum instead and only put really helpful information here.

@Alicja (Alicja), this is out of the scope of this report.

@Lee Jones (moony), you can always gain some %% of speedup by fine-tuning parameters for a particular hardware. That's NOT what we're troubleshooting. While you say there are more optimal tile size it'll be more optimal for all the OSes (since it depends on hardware, basically number of threads on the GPU). Tweaking tile size will NOT change the fact that Windows 10 renders 3 times slower than Linux or Windows 7.

@Roman (kOsMos), We don't have Titan X here in the studio, but i can not confirm such a slowdown with 980Ti (which is the same Compute Capability 5.2) we do have here, both 2.76 and 2.77a behaves same slow.

== To conclude ==

So what did a while ago was we've created a special build which avoids ANY of CPU interaction during GPU rendering, avoiding any possible latency (tile was fully sampled on GPU, only final tile result was reported back to CPU). This way we're sure we're loading GPU as much as possible.

This test was tested by @Adam (-pXd-) (only by him btw, nobody else even dared to do tests which are needed for further investigation). This did not give any measurable difference in the render time (if i read timing correct and comparing it to proper baseline). This means the root of the issue of render time difference between various platforms is not the way how we launch CUDA kernels, root of the issue is inside of the driver of OS itself.

We also went couple of versions back and compared render times between Windows 10 and Linux, and Linux was consistently faster (around 3 times) on the same hardware. (And as we tested before, Windows 7 was quite on the same level as Linux). So we can not confirm any claims that sm_52-based cards were faster in previous releases.

We also tested regular 980 card (NOT a Ti) on Windows 10 and Linux in the same machine. And surely enough it was much slower in Windows 10 again.

All this currently leads us to a conclusion that it's something fundamentally broken in either Windows 10 itself or NVidia's driver for this platform. This isn't something we can look ourselves (well, we could, but MS is not really happy about reverse-engineering their products ;). For until some major update happens from either Microsoft or NVidia sides i don't see what else we can do here currently.

P.S. Comparing Cycles to IRay is not really legit. IRay is specifically designed and optimized for CUDA architecture. Additionally, as far as i can see nobody compared IRay on Win10 and Linux, so you can't say IRay's performance is on it's maximum either, it might be same 3x times faster on Linux.
P.P.S. Again, even inoptimal design of Cycles Kernel does not cancel the fact that it is only slow on Windows 10 and that's it's much-much faster on Win7 and Linux.

I would agree with your conclusion only if Octane was also slower but its 20%+ faster than 780ti which is expected so pointing to windows 10 or nvidia drivers is possible but i doubt it. I am going to test your conclusion in a fresh install of win7pro and report back soon. Thanks.

@Sergey Sharybin (sergey) this is very interesting news. What should us Win10 980ti/TitanX users do next? How should we highlight this to the companies that can make a difference?

Did anyone test the TDR related settings and their impact on this ?

https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918%28v=vs.85%29.aspx

I would increase all timeouts by a factor of 10 at least and maybe just try to disable tdr all together by setting level to 0.

How can it be an Nvidia driver issue when previous versions of Blender worked fine with the 980Ti and Titan on the same OS and Nvidia driver. The only thing that changed was the Blender version, which tells me it's something in The new blender version that is the issue. Furthermore this is a cross platform issue. I have OSX 10.10.4 and have the same problem.

@Mike (LMProductions), we can only comment on bugs we can reproduce, and as i already mentioned above: we can NOT confirm that previous versions of Blender are measurable faster so far.

You might be experiencing another issue (which you actually reported in T47808, and where crucial part of information was given by @Joel Gerlach (joelgerlach) -- you can't compare 2.76 and 2.77 releases with a smoke render because 2.76 did not have smoke on GPU). I'll check that report tomorrow and if it'll reproduceable here re-open as a separate issue.

Roman: Octane is a CUDA-only render engine, it probably uses much smaller kernels - its a complete different architecture.

Further: I have witnessed Sergey doing tests, and he did an incredible thorough job, spending days on it.
This is not in our hands anymore.

The error is in the Nvidia drivers for Windows 10. We cannot fix this. Tell Nvidia. Tell Microsoft. Or use Linux or Windows7.

I'd like to also add that Windows 8.1 works too, Just as good as Windows 7.

I'd like to also add that Windows 8.1 works too, Just as good as Windows 7.

What works? You telling me that you can render mike pans bmw scene under 20s with titan x or 980ti under windows 8.1 or 7??!

Did anyone test the TDR related settings and their impact on this ?

https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918%28v=vs.85%29.aspx

I would increase all timeouts by a factor of 10 at least and maybe just try to disable tdr all together by setting level to 0.

I changed the TdrDelay and TdrLevel (22 for each) and whilst there is no improvement in speed, as to be expected, there is a great improvement in terms of stability. On the default Windows 10 values I would get Tdr errors and Blender would crash when rendering progressively for complex scenes. Increasing the Tdr delay lets Windows give the graphics card more time to render, so it no longer produces Tdr timeout errors. I'm now happy with this setting and although the Titan X is not quite as fast as the old Titan, it is now stable and it renders OK using relatively small tiles. Lets hope Nvidia and/or Microsoft improve the speed someday!

Lee Jones (moony) added a comment.EditedMay 16 2016, 11:44 PM

There must be something going on internal to Blender though (in addition to any Windows 10 issues).

Over the past year or so that I had my windows 10, 980Ti machine - I have run the BMW benchmark on a few versions of Blender as they have been released. Using a tile size of 960X540 (which is what I determined under 2.75a gave me the fastest render) I got the following render times:

2.75a (Sep 2015) = 1:16
2.77 (April 2016) = 2:43

Ok - it could be argued that patches/drivers for Windows and Nvidia etc may have changed between versions and could account for the massive performance hit - so I just downloaded a few versions of Blender from the repository and rerun the test again back to back. I simply loaded the BMW scene and hit render - so all the scene settings are as loaded from the file.

2.73 = 1:04
2.75a = 1:19
2.76 = 1:19
2.77 = 2:02

Whilst I rendered these scenes I monitored system performance using a utility called GPU-Z. Most of the figures looked the same regardless of which blender version I was using (GPU load, Memory used , Power Consumption etc) - however one figure was very different "Memory Controller Load".

For 2.73, 2.75a and 2.76 this value sat around 30-40% mark throughout the render - bouncing around a little, but consistently high. For version 2.77 however, "memory controller load" peaked at around half that figure - never getting any higher than around 15% at any point in the render.

I don't know whether higher or lower figures are better when it comes to this parameter - but it did strike me as odd that it should undergo a step change between blender versions in much the same way as render times have.

@Lee Jones (moony) I am beginning to believe there is a memory component to this as well. In all our tests here (Moony, we have the same types of results as you, showcasing a marked slowdown in 2.77 that is not present in 2.75 or 2.76 while utilizing the same drivers) I've noticed that the Memory Controller Load on the TitanX is significantly reduced while in 2.77 that is not present in other renders. So I can independently verify your results. Is it possible that GPUs which contain higher vRAM (6GB and over) are experiencing some kind of memory leak or underutilization due to some kind of RAM limit? I'm a bit out of my depth on this, clearly.

@Sergey Sharybin (sergey) Thanks for bringing some order back to this report. :) I do want to note that the Smoke on GPU I believe is related to this problem, not independent from it, as in that environment the 780ti was exponentially faster than the TitanX when rendering smoke. I believe volumetric calculations exacerbate this problem we're reporting here, though the subject of the rendering is slightly different. We're still seeing the 780ti outpreform the TitanX every single time. That is, whenever we can get it to render as the 708ti's 3GB vRAM limit is rather limiting these days. :)

I did not see the earlier build you had, Sergey, for testing out the GPU calculations. If you'd like I can run some tests here on the TitanX to see if the results would be similar to what Adam yielded. Where can I find that build?

Perhaps we just need to pool together to buy the foundation a TitanX. :)

@Lee Jones (moony) I am beginning to believe there is a memory component to this as well. In all our tests here (Moony, we have the same types of results as you, showcasing a marked slowdown in 2.77 that is not present in 2.75 or 2.76 while utilizing the same drivers) I've noticed that the Memory Controller Load on the TitanX is significantly reduced while in 2.77 that is not present in other renders. So I can independently verify your results. Is it possible that GPUs which contain higher vRAM (6GB and over) are experiencing some kind of memory leak or underutilization due to some kind of RAM limit? I'm a bit out of my depth on this, clearly.

@Sergey Sharybin (sergey) Thanks for bringing some order back to this report. :) I do want to note that the Smoke on GPU I believe is related to this problem, not independent from it, as in that environment the 780ti was exponentially faster than the TitanX when rendering smoke. I believe volumetric calculations exacerbate this problem we're reporting here, though the subject of the rendering is slightly different. We're still seeing the 780ti outpreform the TitanX every single time. That is, whenever we can get it to render as the 708ti's 3GB vRAM limit is rather limiting these days. :)

I did not see the earlier build you had, Sergey, for testing out the GPU calculations. If you'd like I can run some tests here on the TitanX to see if the results would be similar to what Adam yielded. Where can I find that build?

Perhaps we just need to pool together to buy the foundation a TitanX. :)

Higher vRAM was what I also suggested could be the issue.

Does it make a difference on which OS a program gets compiled on?

Lee Jones (moony) added a comment.EditedMay 17 2016, 12:37 AM

I have just logged the results of my testing to a file so I could analyse them in Excel.

I re-ran the test again on 2.73, 2.76 and 2.77

Render time:
1:05, 1:19, 2:02

Average Memory Controller Load (%)
30.1%, 28.2%, 17.6%

Average Power Consumption (% TDP - 'Thermal Design Power'......apparently)
59.4%, 57.2%, 53.8%

I plotted render time vs power consumption and memory controller load - and there is a good correlation with a linear least squares fit (especially for render time vs Memory Controller load which has an r-squared value greater than 0.99).

It seems that the memory controller loading is much lower in 2.77 compared to 2.76 and 2.73. The power consumption of the card is also much lower (meaning it is being underutilised?)

I guess what this data does not answer is whether the lower memory controller load and power consumption is the cause of the longer render time - or a consequence of it.

I have just logged the results of my testing to a file so I could analyse them in Excel.

I re-ran the test again on 2.73, 2.76 and 2.77

Render time:
1:05, 1:19, 2:02

Average Memory Controller Load (%)
30.1%, 28.2%, 17.6%

Average Power Consumption (% TDP - 'Thermal Design Power'......apparently)
59.4%, 57.2%, 53.8%

I plotted render time vs power consumption and memory controller load - and there is a good correlation with a linear least squares fit (especially for render time vs Memory Controller load which has an r-squared value greater than 0.99).

It seems that the memory controller loading is much lower in 2.77 compared to 2.76 and 2.73. The power consumption of the card is also much lower (meaning it is being underutilised?)

I guess what this data does not answer is whether the lower memory controller load and power consumption is the cause of the longer render time - or a consequence of it.

I would have to agree. My power consumption is low on the 980ti , low enough that the fans stay near idle when rendering which isnt the case on my 760 and isnt the case when rendering in iray on the 980ti, though as sergey mentioned cant really compare with iray. Also, im not sure if this is what you're referring to with the mem controller but my scene takes a lot longer to load onto the 980ti.

Roman (kOsMos) added a comment.EditedMay 17 2016, 3:48 AM

Roman: Octane is a CUDA-only render engine, it probably uses much smaller kernels - its a complete different architecture.

Further: I have witnessed Sergey doing tests, and he did an incredible thorough job, spending days on it.
This is not in our hands anymore.

**The error is in the Nvidia drivers for Windows 10. We cannot fix this. Tell Nvidia. Tell Microsoft. Or use Linux or Windows7.

Got under 20s with Titan X in Windows 8.1. Tiles 512x512 seem to give best result.
How come this has not been documented yet in manual that there is issue with windows 10 and people should use win 7 or 8.1 for optimal performance? This should be in bold letters on every page! :) This is not small thing to ignore, going from 56s in windows 10 to 19.55s (at stock gpu clocks) in windows 8.1 its huge time savings. So Linux should be even faster? Same results in v2.77 and 2.76 under windows 8.1!

Win: 8.1 Pro
GPU: Titan X
Blender: v2.77a and v2.76
Scene: BMWM1 Mike Pan
Tiles: 512x512
Memory Controler Load: 56%(max)
Power Consumption: 75%(Max)
GPU Load: 99%
*Make sure you to set GLSL under Images Draw Method for even faster renders, this speed up my test render by 2 seconds, basically 10%!*

So what does this prove? Seems it points to Geforce Drivers for windows 10 to have a bug for Maxwell GPU's as @Ton Roosendaal (ton) pointed out. I have sent a ticket to Nvidia support and will report back once I hear back from them!

Here is more good stuff:

*When rendering in Windows 10, Titan X is processing 7 samples per second, but in windows 8.1 it is processing at least 21 to 25+ !!!*

*GPU Memory Load Controller Load in Windows 10 is max 22% whereas in Windows 8.1 is 56%
Something is throteling the GPU in cycles.. I also tested Octane under Win 8.1 and it performs exactly the same as in Win 10.

  • In Octane, GPU Memory Controller Load is at 70%!!! So why in Cycles it "caps" at 22%?** this is pointing more to a bug in cycles that it is not compatible with windows 10 and not nvidia driver. We will find out soon.

I think we are getting very close where this bug is! :)

@Lee Jones (moony) thanks for pointing out the Memory controller load, i think this is a huge lead!

Issues with speed regression on same OS, same GPU and everything belongs to T47808. I'll be checking it today.

It is important to separate two completely orthogonal cases:

  1. Render time difference caused by ongoing Blender development.

Those could be detected when running different versions of Blender on same exact machine. This seems to be in the scope of T47808, and not this report. This issue we'll look into, and first guess is that it's caused by higher register pressure caused by enabling SSS by default.

  1. Render time difference of same Blender version on different OS/hardware.

This is what this report is about and something what we concluded is not under our control.

Replying the power consumption theory. Making decision on a memory controller power consumption is totally misleading. Kernel can be using local registers and not access to the global memory. Even tho it'll cause lower load on the memory controller, this is actually ideal case scenario -- all the data is local, can accessed without penalties accessing the global memory.

Additionally, Blender can NOT control any power settings. Just period. It's all under the driver settings and such.

Another possibility here is that it's something to do with the switch to a newer CUDA toolkit.

But please, don't make a mess from the bug reports. Again, it's not a forum, develoeprs needs time to deal with all the piling bugs and (what's most important) need to be able to reproduce the bug. We will be looking into reported speed regression on the same-hardware-configuration. Please just be patient.

@Sergey Sharybin (sergey)

Let me know if you want me to do any specific linux tests with my same Win 10 standalone 980ti rig, I'll just boot into a live/persistent usb.

So that we don't loose momentum on this I've created a Blender Artists Thread

And a thread on GeForce Thread

I cannot find a suitable place to report this to Microsoft. Anyone know?

I reported this bug to NVidia and they're asking if the problem still exists with the latest driver version (365.19). Can anyone confirm if that's the case?

The problem is the same with the latest driver.

Klaus (Klaus) added a comment.EditedMay 17 2016, 9:50 PM

Got under 20s with Titan X in Windows 8.1. Tiles 512x512 seem to give best result.

Are you talking about the new BMW scene with two cars (BMW27.blend)?

There are lots of reports for this scene to take more than one minute with blender 2.77, Linux and a single Titan X. On my system I get 1:06 (which is not significantly better than the results with a 760ti). 20s would be really satisfying.

Update: I just did a test on Win 8.1 Enterprise on my Titan X (driver version 365.19): 1:08 min. There is no big difference between Linux and Win 8.1 on my computer.

Lee Jones (moony) added a comment.EditedMay 17 2016, 10:55 PM

The problem is the same with the latest driver.

Yep I concur - I updated my drivers to 365.19 last night. Results today are the same (I have posted some figures on the Blenderartists thread).

Roman (kOsMos) added a comment.EditedMay 17 2016, 11:46 PM

Got under 20s with Titan X in Windows 8.1. Tiles 512x512 seem to give best result.

Are you talking about the new BMW scene with two cars (BMW27.blend)?

There are lots of reports for this scene to take more than one minute with blender 2.77, Linux and a single Titan X. On my system I get 1:06 (which is not significantly better than the results with a 760ti). 20s would be really satisfying.

Update: I just did a test on Win 8.1 Enterprise on my Titan X (driver version 365.19): 1:08 min. There is no big difference between Linux and Win 8.1 on my computer.

I use the old BMW1M-MikePan.blend scene.

There is No difference with latest Geforce Driver 365.19

Here are results with BMW27.blend file

Win 10
240x136 1:56 (Default)

Win 8.1
240x136 1:01 (Default)
512x512 0:51
480x540 0:49

no change with Geforce Driver 365.19. All results same as I reported before.

Thanks! I've reported it back to NVidia.

Just for completeness - I tried 365.19 as well, no change in performance.

BMW27.blend.
Hard slowdown here too.

2.73a
364.72
64GB

CPU: Intel Core(TM) i7-5960X CPU @ 3.00GHz @4.17GHz 8 Cores logische 16
GPU: GeForce GTX TitanX
OS: Windows 10 64bit

Time: 0 min 23 seconds (4x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto
Time: 1 min 02 seconds (1x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto

Time: 1 min 51 sec (CPU) 32x32 Tiles Auto

2.76
364.72
64GB

CPU: Intel Core(TM) i7-5960X CPU @ 3.00GHz @4.17GHz 8 Cores logische 16
GPU: GeForce GTX TitanX
OS: Windows 10 64bit

Time: 0 min 29 seconds (4x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto
Time: 1 min 32 seconds (1x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto

Time: 1 min 57 sec (CPU) 32x32 Tiles Auto

2.77a
364.72
64GB

CPU: Intel Core(TM) i7-5960X CPU @ 3.00GHz @4.17GHz 8 Cores logische 16
GPU: GeForce GTX TitanX
OS: Windows 10 64bit

Time: 0 min 46 seconds (4x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto
Time: 2 min 24 seconds (1x GeForce GTX TitanX - CUDA) 240x180 Tiles Auto

Time: 1 min 56 sec (CPU) 32x32 Tiles Auto

NVidia have now been able to confirm the issue, and they have assigned it to the appropriate developer team. They had to put in quite some effort before they succeeded in reproducing this issue, testing with various GPUs, so I think they're taking this seriously.

That's good news, thanks very much for referring the problem to them.

@Brecht Van Lommel (brecht) Thanks for the update, looking forward to getting this sorted!

That's a very good news.
After monthes of testing, it's good to hear that there is actually a problem and it was identified by the card supplier.

I really hope they'll find the solution very soon and we'll be able to enjoy our investment :)

Thanks for having pushed nvidia.

Roman (kOsMos) added a comment.EditedMay 23 2016, 4:40 PM

Ive been communicating with Nvidia Support for several days now, I also made a video for their dev team exactly how to reproduce the issue in win8.1 and win 10. This is the last worthy response I received from them. Supposedly the error has something to do with WDDM 2.0 since thats what windows 10 uses. Windows 7/8/8.1 use WDDM 1.1/1.2/1.3. yw :D

Subject
Rendering time higher in windows 10

Discussion Thread
Response Via Email (Farzana) 05/23/2016 06:07 AM
Hello,

Your case is being escalated to our Level 2 Technical Support group for further attention. The Level 2 agents will review the case notes to troubleshoot the issue and find a solution or workaround. As this process may take some time and require a good deal of testing and research, we ask that you be patient. A Level 2 tech will contact you as soon they can to assist or point you in the right direction.

Best Regards,
-Farzana
NVIDIA Customer Care

Ive been communicating with Nvidia Support for several days now, I also made a video for their dev team exactly how to reproduce the issue in win8.1 and win 10. This is the last worthy response I received from them. Supposedly the error has something to do with WDDM 2.0 since thats what windows 10 uses. Windows 7/8/8.1 use WDDM 1.1/1.2/1.3. yw :D

Subject
Rendering time higher in windows 10

Discussion Thread
Response Via Email (Farzana)	05/23/2016 06:07 AM

Hello,

Your case is being escalated to our Level 2 Technical Support group for further attention. The Level 2 agents will review the case notes to troubleshoot the issue and find a solution or workaround. As this process may take some time and require a good deal of testing and research, we ask that you be patient. A Level 2 tech will contact you as soon they can to assist or point you in the right direction.

Best Regards,
-Farzana
NVIDIA Customer Care

Remember this is a Mac issue too. Not solely windows. With that perspective in mind it might help nvidia narrow down the issue.

Response Via Email (Troy) 05/24/2016 04:34 PM

Hello,

NVIDIA QA has reproduced the problem and Engineering is investigating it.

Best Regards,

Troy
NVIDIA Customer Care

There is no reason to paste each occasion when NVidia support accepts your issue, we already know that they accepted it. Respect others time and time of developers who'll need to scroll over all the comments here trying to extract meaningful information.

Anyway, did anyone try 368.22 driver? From some feedback seems it brings performance improvements.

2X Titan X
Windows 10 Home
Blender 2.77 2016-05-24 (nightly build from buildbot)
Nvidia Driver: 368.22

I upgrade yesterday to 368.22 and noticed this morning the viewport sampled much faster, so ran the bmw27 benchmark scene. and I went from 1m15s~ with my dual TitanX setup, which FYI was slower than my dual GTX 580 at 1m11s down to 36s now to 33s with larger tiles (the issue before was that large tiles worked really good on fermi cards, while on maxwell small tiles gave a little better time)

On a side note, before I got UI lags when using all cards, leaving none for the OS. Now it's buttery smooth, and they peak at 75C. Whatever they did it's a move in right direction, for me with GM200 / TITANX it's a huge boost.

Roman (kOsMos) added a comment.EditedMay 26 2016, 1:57 PM

There is no reason to paste each occasion when NVidia support accepts your issue, we already know that they accepted it. Respect others time and time of developers who'll need to scroll over all the comments here trying to extract meaningful information.

Anyway, did anyone try 368.22 driver? From some feedback seems it brings performance improvements.

you cant be serious?
A lot of people are curious here to know that Nvidia Support was able to "reproduce" this problem which is i dunno man pretty darn good information. besides I am not posting "each occasion" only what is important. if you really want me I can post the entire email thread here which has over 10 replies. I also made a video outlining step by step for them how to reproduce this problem. So I dunno what you are talking about about respecting others time here. I am the one that put the effort making a how to video for them so dont talk to me about respecting others time.

Also Updated the drivers to 368.22
Windows 10 professional edition.

Here are the result with the bmw27 - 400 samples

Titan black tiles : 256x256
1minute 09 sec

TITAN X : tiles 256x256
2 minutes 39 sec

I've tested on other scenes of my own.
The TITAN X performances are still really bad on my station.

Looking forward to new driver and Nvidia's updates.

@Roman Timashev (roman) @Sergey Sharybin (sergey) , Please guies don't get pissed. I believe everyone is bored with this problem as we all have invested in these cards and can't solve the problem.
We all appreciate all what you're doing to solve this.

Pierrick

Adam (-pXd-) added a comment.EditedMay 26 2016, 2:19 PM

@Sergey Sharybin (sergey)

Just tested new drivers (368.22) on the EVGA 980ti and EVGA 760.

Mikes BMW 2.7 scene:

My previous time was 1:54 on the 980ti.
My previous time was 2:07 on the 760.

New drivers time was 1:56 on the 980ti.
New drivers time 2:06 on the 760.

Though for information accuracy this time the 980ti is running the desktop.

Doesn't seem like much improvement for the 980ti.

Tile size were unaltered: 240 x 136

Interesting results, and quite weird. It might be not be related on drivers, but on the work from @Brecht Van Lommel (brecht) did to reduce stack memory usage on GPU.

So perhaps need both latest driver AND latest builds from builder.blender.org?

Testing with blender latest built 2.77.1 (F2ba 139) - WINDOWS 10

bmw27 - 400 samples - tiles 256x256

Titan Black
1 minute 09 sec.

TITAN X (HOLD YOUR BREATH!!!!)
1 minute 06 sec.

This is the first time I experienced better performances with the TITAN X and compared to official release, it's more than 2x faster (2 minutes 39 sec)

Using TITAN X and tiles : 512x512
56 sec.

Using TITAN X + TITAN black tile : 256x256
35.88 sec.

Another great news is thatwhen using both cards in viewport rendering, it does work very very fast while before using both card was slower than using only the TITAN black (like the X was slowering everything).
I've tested other scene with much more complexe shaders and textures and it works very very well.

So it seems This build works wayyyyyyy better than the current official release.
Really looking forward to your analysis!

Well, kudos to @Brecht Van Lommel (brecht) for that :) And big ones ;)

Now important question: is it same speed for Win10, Win7 and Linux. That is something to be figured out still, because who knows, maybe Linux is still 2x faster.

@Sergey Sharybin (sergey)

Just downloaded latest from builder.blender.org

Definitely faster on the 980ti:

Mikes 2.7 scene: 1:06!

50 seconds faster than the official build it seems for me.

Roman (kOsMos) added a comment.EditedMay 26 2016, 3:28 PM

@Sergey Sharybin (sergey)

Just downloaded latest from builder.blender.org

Definitely faster on the 980ti:

Mikes 2.7 scene: 1:06!

50 seconds faster than the official build it seems for me.

As I suspected a bug within blender/cycles.

New speed record! 18.96s Titan X 480x540 Windows 10 pro with latest blender build. BMW1M-MikePan.blend
This is 1s faster than Windows 8.1pro.

So what has been changed? Why all of a sudden now Titan X is even faster than 780Ti?

Windows 10
NVidia 362.22
Blender 2.77.0 (Nightly - Thu May 26 04:46:46 2016)

160 x 160 - 1:24
180 x 180 - 1:14
240 x 180 - 1:09 (down from 2:10)
240 x 240 - 1:09
480 x 480 - 1:03
512 x 512 - 1:04

Using Auto Tile set to Custom with a Target Size of 480

480 x 270 - 1:01

Using Auto Tile set to Custom with a Target Size of 540

480 x 540 - 1:02

5/26 build of blender 2.77

Nvidia 368.22 drivers

52.92 seconds at 512x512 on Windows 10 Pro 64bit with EVGA 980 Ti SC ACX 2.0

52.99s 2nd run

Nothing was changed from my side,

Before and after installing driver 368.22: 1 min 29 sec
Card: nVidia GTX GeForce Titan X
Tiles: 128X128
Blender version: 2.77a

Nothing was changed from my side,

Before and after installing driver 368.22: 1 min 29 sec
Card: nVidia GTX GeForce Titan X
Tiles: 128X128
Blender version: 2.77a

you need to download the latest build https://builder.blender.org/download/

Nothing was changed from my side,

Before and after installing driver 368.22: 1 min 29 sec
Card: nVidia GTX GeForce Titan X
Tiles: 128X128
Blender version: 2.77a

you need to download the latest build https://builder.blender.org/download/

Thank you Roman,
I just downloaded the latest build, but now I have an 8 sec delay on rendering time:

1 min 37 sec

OS: WIndows 10 64-bit

Greg Zaal (gregzaal) added a comment.EditedMay 27 2016, 8:31 AM

Just tested on Linux & Windows 10 with latest blender and latest drivers, can confirm this issue now seems to be resolved \o/

Before (Build from March):
-Win10 (driver v361.91): 02:00
-Win10 (driver v368.22): 01:58
-Linux: (352.63): 01:04

Now (Build from yesterday):
-Win10 (368.22): 00:54
-Linux (352.63): 00:53

Also tested the other 2 blends on the spreadsheet and added my results: https://docs.google.com/spreadsheets/d/1KS4Ew6wfNmGHVQ_GPUmvdBpuvgKIzgwn_yMV6j_rzJ0/edit?usp=sharing

Only updating to the new driver did not help, but updating to new blender fixed it :)

I didn't test if using old driver with new blender works too.

@Greg Zaal (gregzaal), thanks,very mush good comparison :)

So seems changes from @Brecht Van Lommel (brecht) not only improved memory usage and lead to a speedup on Linux, but also made Win10 drivers happy.

However, it is important to understand, that Win10 drivers are still considered broken and NVidia guys are looking into this. The thing here is: while we managed to reduce stress on GPU and made drivers happy, when we'll increase complexity of the kernel again (upcoming mipmaps, denoise, microdisplacement...) we'll pretty much risking to run into same exact speed issue.

@Sergey Sharybin (sergey) Agreed. As far as I can tell from what I've read NVidia had to incorporate some form of shim into the driver due to the async DX12 problem. I'm not really completely up to speed on this but I think there are some issues with how this shim is effecting things like performance.

It's also interesting that this is a memory related thing since I have a blend here using a large model with refused to render on the 980ti but in these newer builds it flies through the GPU.

I think we can close this report now and consider it resolved. Certainly we'll have to keep an eye on this to ensure it doesn't break again, and hopefully NVidia can improve things on their side. But latest Blender builds should be working OK now with the 980 Ti.

Various other issues came up here and those can get their own reports if they confirmed and worth investigating further, but not all 52 subscribers of this report need to be involved.

Thanks all for the tests & patience.

Mike (LMProductions) added a comment.EditedJun 1 2016, 12:55 AM

I think we can close this report now and consider it resolved. Certainly we'll have to keep an eye on this to ensure it doesn't break again, and hopefully NVidia can improve things on their side. But latest Blender builds should be working OK now with the 980 Ti.

Various other issues came up here and those can get their own reports if they confirmed and worth investigating further, but not all 52 subscribers of this report need to be involved.

Thanks all for the tests & patience.

Im glad things are considered resolved but I had reported a problem on March 15th (Task # T47807 and T47808) where I had problems with slow renders on a MacPro using the GTX 980 Ti and Titan X -- especially with Fire. I know fire rendering was just added in 2.77 but rendering fire on the GPU takes 10-20x longer than with 8-core 3ghz CPUs (literally). A few days later it was merged with this task because the issues were thought to be similiar. Now this task is resolved, but I still have $2400 of new GPUs I cant use. And with the 1080 coming out soon the money is practically lost entirely. My intention is not to blame anyone here for the loss, but point is, this task is a serious problem for us and we have this issue on 4 different machines running 2.77a and OSX and I reported it on March 15th and followed up with it and did several render tests on my end and to date, nothing has been done to address this issue. Can someone please acknowledge this and can we get the ball rolling on troubleshooting this please??

@Mike (LMProductions), have you tried the latest OS X builds from builder.blender.org and confirmed that your issue still exists?

If it does still exist, please comment on the other ticket with render time comparisons between the CPU and GPU for a .blend that we can test, and we can reopen it as a to do item. Please do understand though that this Windows 10 issue got a lot of developer attenuation because it affects many users and every type of scene. If GPU smoke rendering on OS X is slow and only confirmed by one user, it's unlikely to be prioritized by developers over the hundreds of other tasks they have on their list.

@Mike (LMProductions), have you tried the latest OS X builds from builder.blender.org and confirmed that your issue still exists?

If it does still exist, please comment on the other ticket with render time comparisons between the CPU and GPU for a .blend that we can test, and we can reopen it as a to do item. Please do understand though that this Windows 10 issue got a lot of developer attenuation because it affects many users and every type of scene. If GPU smoke rendering on OS X is slow and only confirmed by one user, it's unlikely to be prioritized by developers over the hundreds of other tasks they have on their list.

Render times and a .blend file were uploaded to the task in March. Another user also tested the .blend and confirmed he had the issue as well -- and again, all this was posted in the thread in March. So if you can please re-open the task, that would be a good start.

Mike: this is an open source project and a lot of people here spend their free time on Blender development and debugging. The goal of reporting bugs is to help making Blender better for everyone.

I would also complain at Apple, at Nvidia or AMD if you have issues. After all they took your money.

Thanx everyone involved in this, especially Brecht, Ton, Sergey and of course Dingto:

Performance here (Win7) is (noticeably) better with the new 2.77 builds with 980Ti than in 2.76b!!!
Also there is a BIG performance increase in scene preparation time!

Thank you again!

best regards,
Karsten Bitter