Page MenuHome

Cuda error illegal address when rendering more than 5-10 frames of an animation
Closed, InvalidPublic

Description

This is my first bug report, but I will try my best to include all relevant information.

I'm geting a Cuda error: illegal address in cuCTxSynchronize() when I try to render an animation. I would like to let my computer run overnight, but I can't seem to get more than about 10 frames done before I get this error. It happens on version 2.78 and 2.78a as well as the latest daily build I tried today 11.6.16. One thing to note, with the daily build, I didn't actually get the cuda error, blender just crashed.

version 2.77a runs just fine with no problems rendering the same animation.

My system info: OS X 10.10.5, and Windows 10. CPU i7 4770k 3.5ghz (3.9ghz turbo boost), Motherboard z87x ud5th, Graphics cards 2x gtx 780ti . I have the latest NVIDIA drivers NVIDIA Web Driver 346.02.03f08 (for MAC OS X 10.10.5) and 375.70 (Windows 10) . I'm using my cpu's internal graphics hd4600 iris for my OS so that my 780tis are dedicated to rendering.

Obviously I'm rendering on GPU to get the cuda error.

I've tried rendering a couple different animations to see if it was something specific to my scene, but I got the same results. I also have windows 10 on my machine, but I have not tried rendering on that yet. I will try that next and update.

I hope someone can fix this. The improved vram usage on 2.78 is amazing! I have a scene where I take advantage of that, but with this bug, It's very difficult to render. And going back to 2.77 uses too much Vram for my GPU :-( .

Thank you
-Aaron

update: I spent some time trying it in windows 10 and I am having the same issues. Only in 2.78 and 2.78a though. 2.77a works just fine in windows 10 on my system.

update: I noticed what looks like a crash report in my /tmp/ folder. its a .txt file. Here are the contents:

Blender 2.78 (sub 0), Commit date: 2016-10-24 12:20, Hash e8299c8

Read library: '/Users/GeeztownProductionsHpro/Library/Application Support/Blender/2.78/scripts/addons/sceneterrain/lib.blend', '//../../../../../../../../../Users/GeeztownProductionsHpro/Library/Application Support/Blender/2.78/scripts/addons/sceneterrain/lib.blend', parent '<direct>' # Info

backtrace

0 blender 0x0000000100bf076a BLI_system_backtrace + 58
1 blender 0x000000010015731a sig_handle_crash + 362
2 libsystem_platform.dylib 0x00007fff9160df1a _sigtramp + 26
3 ??? 0x000000000003a2bb 0x0 + 238267
4 blender 0x00000001010a901b _ZN3ccl14BlenderSession20builtin_image_pixelsERKSsPvPh + 283
5 blender 0x00000001010ea85c _ZN3ccl12ImageManager20file_load_byte_imageIhEEbPNS0_5ImageENS0_13ImageDataTypeERNS_13device_vectorIT_EE + 476
6 blender 0x00000001010e61c8 _ZN3ccl12ImageManager17device_load_imageEPNS_6DeviceEPNS_11DeviceSceneENS0_13ImageDataTypeEiPNS_8ProgressE + 2232
7 blender 0x000000010236d12b _ZN3ccl13TaskScheduler10thread_runEi + 75
8 blender 0x000000010236ef6c _ZN3ccl6thread3runEPv + 28
9 libsystem_pthread.dylib 0x00007fff95bef05a _pthread_body + 131
10 libsystem_pthread.dylib 0x00007fff95beefd7 _pthread_body + 0
11 libsystem_pthread.dylib 0x00007fff95bec3ed thread_start + 13

Update: Here is a .blend file

Steps needed to reproduce: Open file. Make sure GPU is selected for rendering. You may need to change the location the rendered frames will be stored. Click "Animation" to start rendering. I have been getting the error within the first 10 frames or so rendered.

Details

Type
Bug

Event Timeline

Sergey Sharybin (sergey) triaged this task as Needs Information from User priority.

We are using GPU quite a lot here in the studio, and do not experience issue. Might be something specific to settings you're using.

So please, always follow bug report guidelines and attach everything requested in there (smallest possible .blend file and exact steps reproducing the problem are the most crucial ones). This helps us to eliminate variables affecting on the issue and do more efficient troubleshooting.

What i'm also not sure about is your note about latest driver 346.02. It is not the latest one on Windows. So please check if the issue happens with driver version 375.70.

Sergey,

Thank you for your response. I apologize for not following the guidelines, for some reason I was not able to find them when I first created this bug report.

I added a .blend file with steps needed to reproduce the error. I also updated the system and driver information listed.

NVIDIA driver 346.02 is on mac OS X 10.10.5. I am in fact using the 375.70 driver on
Windows 10 though. I made sure it was up to date before I tested it.

Please let me know if there is anything else you need.

Thanks
-Aaron

Aaron (Geeztown) added a comment.EditedNov 9 2016, 7:08 AM

One other thing I just noticed. I'm using the classified edition of both of my 780ti cards. I don't know if that makes a difference.

I've been rendering fine on blender 2.77a, but recently I've been experimenting with overclocking my gpus . The classified edition gives you a switch on the card to switch between two different bios. So I flashed a different bios into my secondary bios on both cards to boost my clock speed a bit more. Now I'm getting the cuda error: illegal address in 2.77a. And I can't even render 1 frame like that, at least not the more complicated scene I'm working on. The Blenchmark benchmark addon renders just fine, and in 33 seconds!. But not I'm getting the cuda error when I try to render the animation I've been working on.

I also got cuda error: misaligned address in cuCtxSynchronize .

If I switch back to my primary stock bios, everything is fine again. And I'm still getting the same issue in 2.78a as well.

However, I am also able to render the Blenchmark benchmark test in 2.78a. Just not my more complex scenes it seems.

I wonder if this is related to GPU clock speeds or the stability of my overclock? But if it is, why would it be rendering fine in 2.77a but not 2.78a?

Let me know what you make of this.

Thanks

This comment was removed by Aaron (Geeztown).
Aaron (Geeztown) raised the priority of this task from Needs Information from User to Needs Triage by Developer.

Not reproduced in Win10, 2.78a, with a GeForce GTX 560 Ti.

cuCtxSynchronize just reports the error caused by the kernel invocation just above it.

Nvidia describes this error as:

While executing a kernel, the device encountered a load or store instruction on an invalid memory address. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

Could this be caused by overclocking .. most certainly. Ill look for a sm_35 card to test this. That this is working fine on non sm_35 does help us but it does not rule out a bug in the sm_35 kernel itself. Please do retest with cards running at a more conservative speed

Bastien Montagne (mont29) triaged this task as Normal priority.

@Martijn Berger (juicyfruit) any news? Otherwise would consider closing the report…

Sorry, I've just been busy lately and had relatives visiting from out of town for Thanksgiving. I will test this more this weekend.

I have been letting my computer render an animation for days on end with no issue in 2.77a. The 780tis that I have are the EVGA "classified" version which have a higher stock clock speed than the standard 780ti. I will try flashing a bios with a more conservative speed to my secondary bios on the cards and see if that makes any difference.

Update: Problem Solved!!!

I flashed a bios for a standard gtx 780ti with a lower clock speed (875mhz bass clock, which is stock on the standard version of it) and everything seems to work just fine in 2.78a. I let it render an animation for a full day and it never crashed.

It does seem odd that I was only having this problem in 2.78 though. Maybe 2.78 has a lower tolerance for errors? I don't know. For now, the benefits outweigh the costs of overclocking. I don't think I'll be wasting money on any overclocked versions of cards in the future.

Thanks
-Aaron

Bastien Montagne (mont29) claimed this task.

So, was indeed a hardware issue in the end. :)