[regression] OpenCL performance becomes very random with big scenes. #53249

New Issue

mathieu menuet · 2017-11-04T08:33:00+01:00

mathieu menuet commented

2017-11-04 08:33:00 +01:00

System Information
Win7 x64, Vega64, driver 17.10.2

Blender Version
Broken: e8daf2e
Worked: 2.79

Short description of error
When rendering the same frame of the same scene multiple times with OpenCL with the same instance of Blender, render times vary from 100% to >300%. Restarting Blender and rerendering gives the same time as the first time

Exact steps for others to reproduce the error

Start latest master of blender with --debug-cycles
Open the victor scene from official benchmark pack https://download.blender.org/demo/test/cycles_benchmark_20160228.zip , render with GPU (tile size 64x64 for example) many times and note the times without synchronization.

I did it on a small border render part at low resolution to have about 20 tiles of 64x64 with fur and grass. The first render after (re)starting was always 29seconds long, the next ones were random, between 45seconds and 102seconds.
The same test with 2.79 gave 145.6seconds +/-1% on 3 consecutive renders without restart. So the bug appeared after 2.79. Something was changed in allocation or global size calculation maybe?

**System Information** Win7 x64, Vega64, driver 17.10.2 **Blender Version** Broken: e8daf2e Worked: 2.79 **Short description of error** When rendering the same frame of the same scene multiple times with OpenCL with the same instance of Blender, render times vary from 100% to >300%. Restarting Blender and rerendering gives the same time as the first time **Exact steps for others to reproduce the error** - Start latest master of blender with --debug-cycles - Open the victor scene from official benchmark pack https://download.blender.org/demo/test/cycles_benchmark_20160228.zip , render with GPU (tile size 64x64 for example) many times and note the times without synchronization. I did it on a small border render part at low resolution to have about 20 tiles of 64x64 with fur and grass. The first render after (re)starting was always 29seconds long, the next ones were random, between 45seconds and 102seconds. The same test with 2.79 gave 145.6seconds +/-1% on 3 consecutive renders without restart. So the bug appeared after 2.79. Something was changed in allocation or global size calculation maybe?

mathieu menuet commented

2017-11-04 08:33:00 +01:00

Changed status to: 'Open'

mathieu menuet commented

2017-11-04 08:33:00 +01:00

Added subscriber: @bliblubli

mathieu menuet commented

2017-11-04 08:34:31 +01:00

Note that in all above cases, system memory is used as the scene doesn't fit in the dedicated 8Gb memory. So it doesn't seem to come from how the drivers allocates the memory between dedicated and system memory.
Also, power usage of GPU was reduced to ensure no throttling happens, frequency were stable during all the tests.

Note that in all above cases, system memory is used as the scene doesn't fit in the dedicated 8Gb memory. So it doesn't seem to come from how the drivers allocates the memory between dedicated and system memory. Also, power usage of GPU was reduced to ensure no throttling happens, frequency were stable during all the tests.

Mai Lavelle was assigned by mathieu menuet

2017-11-04 08:37:58 +01:00

mathieu menuet commented

2017-11-04 08:38:30 +01:00

Added subscribers: @Sergey, @brecht

mathieu menuet changed title from ~~OpenCL performance becomes very random with big scenes.~~ to [regression] OpenCL performance becomes very random with big scenes.

2017-11-04 09:49:13 +01:00

Brecht Van Lommel commented

2017-11-04 18:27:01 +01:00

I thought Victor only rendered on AMD after ec8ae4d, not yet in the 2.79 release?

In any case, we never added explicit support for using system memory, and leave it totally up to the driver to decide which memory to move where. If it decides to e.g. put image textures in VRAM and keep the BVH or tile render buffers in system memory, that could cause big performance differences.

I guess the first step would be to git bisect to where the problem started. I don't have an AMD card with HBCC support though. Testing with the OpenCL context cache disabled could gives some clues if it's something in the context that leaks or has an unintended lasting effect, or if it's something else.

I thought Victor only rendered on AMD after ec8ae4d, not yet in the 2.79 release? In any case, we never added explicit support for using system memory, and leave it totally up to the driver to decide which memory to move where. If it decides to e.g. put image textures in VRAM and keep the BVH or tile render buffers in system memory, that could cause big performance differences. I guess the first step would be to git bisect to where the problem started. I don't have an AMD card with HBCC support though. Testing with the OpenCL context cache disabled could gives some clues if it's something in the context that leaks or has an unintended lasting effect, or if it's something else.

mathieu menuet commented

2017-11-04 19:27:57 +01:00

ec8ae4d5e9 only added support for more than 4GB of textures iirc.
You don't need HBCC support. On win7, even on Vega, there is no HBCC and my RX480 also renders full victor scene since a year on windows and since some months on Linux.
@brecht Is there a simple command to disable the context caching?
I could try to bisect, but @MaiLavelle should have better guesses of what could have introduced this bug. The scene preparation of Victor takes more than 2 minutes on my computer. With compile time on windows on top, shooting in the dark to bisect would take a lot of time.

ec8ae4d5e9 only added support for more than 4GB of textures iirc. You don't need HBCC support. On win7, even on Vega, there is no HBCC and my RX480 also renders full victor scene since a year on windows and since some months on Linux. @brecht Is there a simple command to disable the context caching? I could try to bisect, but @MaiLavelle should have better guesses of what could have introduced this bug. The scene preparation of Victor takes more than 2 minutes on my computer. With compile time on windows on top, shooting in the dark to bisect would take a lot of time.

mathieu menuet commented

2017-11-04 19:59:34 +01:00

just to give an idea of the mess to bisect:

cuda disables completly opencl in the majority of revision, so you have to rebuild without cuda
device selection changed, so userpref have to be modified depending on the revision you test and bisecting requires to go back and forth in time.
kernel compilation takes 1min50 for victor
scene preparation takes 2min04
so it takes about 5minutes of VS compile, then manual tweaks for user pref, then 2minute kernel compile+ then 2 renders at 2 (scene preps)+2(render)=8minutes of rendering. That's a quarter of an hour with 4 user intervention between which you can't do much.

So here is my contribution after an hour of work: the bug was already there 29.09.2017

just to give an idea of the mess to bisect: - cuda disables completly opencl in the majority of revision, so you have to rebuild without cuda - device selection changed, so userpref have to be modified depending on the revision you test and bisecting requires to go back and forth in time. - kernel compilation takes 1min50 for victor - scene preparation takes 2min04 so it takes about 5minutes of VS compile, then manual tweaks for user pref, then 2minute kernel compile+ then 2 renders at 2 (scene preps)+2(render)=8minutes of rendering. That's a quarter of an hour with 4 user intervention between which you can't do much. So here is my contribution after an hour of work: the bug was already there 29.09.2017

mathieu menuet commented

2017-11-04 20:15:47 +01:00

the bug was already there 24.08.2017, so my guess is that ec8ae4d5e9 is the commit we look for.

the bug was already there 24.08.2017, so my guess is that ec8ae4d5e9 is the commit we look for.

mathieu menuet commented

2017-11-04 20:19:31 +01:00

This comment was removed by @bliblubli

*This comment was removed by @bliblubli*

mathieu menuet commented

2017-11-04 21:02:14 +01:00

got some explanations on IRC, sorry I didn't know the whole story.

mathieu menuet commented

2017-11-05 18:44:29 +01:00

commit b53e35c655 already has the bug, so it's not due to the buffer patch.

commit b53e35c655d4 already has the bug, so it's not due to the buffer patch.

mathieu menuet commented

2017-11-05 20:30:51 +01:00

actually, 2.79 has the bug, only the official one had the device selection bug and took the 1080Ti instead, which doesn't use system memory.
So it may be a driver bug, but then why is the first render always 30sec?
After some renders, I got up to 114seconds to render = nearly 3x slower... At this point however, the GPU was idling a lot, maybe waiting all the time for system memory access?
Here is a picture of the task manager with 2 consecutive renders on the same instance of Blender.

It may be a coincidence, but VS2013 builds had only +/-10% between first and consecutive renders (made 5 of them) while VS2015 builds go crazy with up to 3x the render time.
If someone could test on Linux with a RX480 to see if GCC or the Linux driver handles this differently. As said before, the RX480 can render this scene. On Linux, the Nvidia drivers destroy a part of the AMD driver and I couldn't find a solution to have both drivers side by side yet.

actually, 2.79 has the bug, only the official one had the device selection bug and took the 1080Ti instead, which doesn't use system memory. So it may be a driver bug, but then why is the first render always 30sec? After some renders, I got up to 114seconds to render = nearly 3x slower... At this point however, the GPU was idling a lot, maybe waiting all the time for system memory access? Here is a picture of the task manager with 2 consecutive renders on the same instance of Blender. ![bug victor OpenCL.png](https://archive.blender.org/developer/F1096521/bug_victor_OpenCL.png) It may be a coincidence, but VS2013 builds had only +/-10% between first and consecutive renders (made 5 of them) while VS2015 builds go crazy with up to 3x the render time. If someone could test on Linux with a RX480 to see if GCC or the Linux driver handles this differently. As said before, the RX480 can render this scene. On Linux, the Nvidia drivers destroy a part of the AMD driver and I couldn't find a solution to have both drivers side by side yet.

Ray molenkamp commented

2017-11-05 20:34:23 +01:00

Added subscriber: @LazyDodo

Ray molenkamp commented

2017-11-05 20:34:23 +01:00

Could be interesting to look at the output of gpu-z to see what the cards actual memory is doing in between those two runs.

Brecht Van Lommel commented

2017-11-06 00:38:00 +01:00

Thanks for the tests! To be clear I'm not expecting anyone to work on this bug, and if no one else does I'll probably do it at some point, but the work is certainly helpful.

The graph from the task manager is interesting. It is only showing host memory so it doesn't give the whole picture, but it does look like there is no significant host memory leak after the first render. The profiles are similar for both renders, only at the start of the second render there seems to be an extra bump. Perhaps we can spot a corresponding allocation in the output of running with --debug-cycles.

If not I guess it's something internal in the OpenCL driver, for which I don't think there is any debug output we can look at? Maybe it's the driver deciding to migrate some device memory back to host memory, possibly memory that we leaked from the previous render? I couldn't spot any memory leaks in the OpenCL device code, and it's not clear to me what exactly could be hanging around in the context.

Here's a patch to disable the context cache: P555.

Eventually we should probably use clEnqueueMigrateMemObjects to explicitly tell the drivers which buffers should go on the device and which on the host. But I expect there's some other issue going on here.

Thanks for the tests! To be clear I'm not expecting anyone to work on this bug, and if no one else does I'll probably do it at some point, but the work is certainly helpful. The graph from the task manager is interesting. It is only showing host memory so it doesn't give the whole picture, but it does look like there is no significant host memory leak after the first render. The profiles are similar for both renders, only at the start of the second render there seems to be an extra bump. Perhaps we can spot a corresponding allocation in the output of running with `--debug-cycles`. If not I guess it's something internal in the OpenCL driver, for which I don't think there is any debug output we can look at? Maybe it's the driver deciding to migrate some device memory back to host memory, possibly memory that we leaked from the previous render? I couldn't spot any memory leaks in the OpenCL device code, and it's not clear to me what exactly could be hanging around in the context. Here's a patch to disable the context cache: [P555](https://archive.blender.org/developer/P555.txt). Eventually we should probably use `clEnqueueMigrateMemObjects` to explicitly tell the drivers which buffers should go on the device and which on the host. But I expect there's some other issue going on here.

mathieu menuet commented

2017-11-06 08:38:32 +01:00

@LazyDodo the GPU-Z log is wrong somehow, it ignores half of the memory. But it gives the impression that no memory leak happens on the GPU.

@brecht here is a log of 2 consecutive render
victor.log

Contrary to GPU-Z, here Cycles reports that free memory is different and calculate a very different global size, which is known to impact performance a lot. The strange thing is that the second render reports more free memory (about 4Gb against 1Gb). It results in a bigger global size, which should speedup the rendering, but as most of the data is then in system memory, it waits most of the time.

So it could look like the second time for some reason the driver decides to put some buffers in system memory. However, if we compare with the task manager graph, memory usage between first and second render is more in the +8gb range, while cycles reports only 3GB (from 1 to 4GB) more as free on the GPU...

If someone has a direct wire to the AMD driver team, that would be great to tell them about this bug.

@LazyDodo the GPU-Z log is wrong somehow, it ignores half of the memory. But it gives the impression that no memory leak happens on the GPU. ![victor.png](https://archive.blender.org/developer/F1098121/victor.png) @brecht here is a log of 2 consecutive render [victor.log](https://archive.blender.org/developer/F1098127/victor.log) Contrary to GPU-Z, here Cycles reports that free memory is different and calculate a very different global size, which is known to impact performance a lot. The strange thing is that the second render reports more free memory (about 4Gb against 1Gb). It results in a bigger global size, which should speedup the rendering, but as most of the data is then in system memory, it waits most of the time. So it could look like the second time for some reason the driver decides to put some buffers in system memory. However, if we compare with the task manager graph, memory usage between first and second render is more in the +8gb range, while cycles reports only 3GB (from 1 to 4GB) more as free on the GPU... If someone has a direct wire to the AMD driver team, that would be great to tell them about this bug.

mathieu menuet commented

2017-11-06 10:24:42 +01:00

@brecht thanks for P555 . Tried it but bug is still there.

@brecht thanks for [P555](https://archive.blender.org/developer/P555.txt) . Tried it but bug is still there.

Brecht Van Lommel commented

2017-11-06 12:47:39 +01:00

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by `max_buffer_size` and `num_elements`, which from the logs don't appear to change. Yet the global size is reported as being different. In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see [D2056](https://archive.blender.org/developer/D2056) for difficulties with that, for the split kernel it gets more complicated).

mathieu menuet commented

2017-11-06 14:25:36 +01:00

In #53249#469559, @brecht wrote:
Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

Yes, I used another version to get the free memory reported and tried to see if limiting global size to make it all fit in memory would solve the problem, but it didn't. I can redo the log with vanilla master if you want. Here is the code:

		VLOG(1) << "Maximum device allocation size: "
		        << string_human_readable_number(max_buffer_size) << " bytes. ("
		        << string_human_readable_size(max_buffer_size) << ").";

		/* Limit to 2gb, as we shouldn't need more than that and some devices may support much more. */
		*max_buffer_size = min(max_buffer_size / 2, (cl_ulong)2l*1024*1024*1024);* size_t num_elements = max_elements_for_max_buffer_size(kg, data, max_buffer_size);
		cl_ulong free_mem_amd = 0;
		if(clGetDeviceInfo(device->cdDevice, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, sizeof(cl_ulong), &free_mem_amd, NULL) == CL_SUCCESS) {
			free_mem_amd *= 1024;
			VLOG(1) << "Free mem AMD: "
			        << string_human_readable_number(free_mem_amd) << " bytes. ("
			        << string_human_readable_size(free_mem_amd) << ").";
			if(max_buffer_size > free_mem_amd) {
				max_buffer_size = free_mem_amd;
			}
		}

code is from @nirved-1

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Each scene will certainly have it's own optimal memory layout. Some scene have very simple materials/textures, but heavy BVH, some the opposite, etc. Wouldn't it be possible to make a bit like PGO, have a "optimize" button that would render the scene in a special mode, putting the different buffers/textures at different places and save the timings of the different layouts. Then write somewhere in the scene custom data the fastest layout. This one would then be used for all renders of that scene. Of course, it may have to be updated later if big changes are made, but most of the time, it will be used once before sending to the render farm.

> In #53249#469559, @brecht wrote: > Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by `max_buffer_size` and `num_elements`, which from the logs don't appear to change. Yet the global size is reported as being different. Yes, I used another version to get the free memory reported and tried to see if limiting global size to make it all fit in memory would solve the problem, but it didn't. I can redo the log with vanilla master if you want. Here is the code: ``` VLOG(1) << "Maximum device allocation size: " << string_human_readable_number(max_buffer_size) << " bytes. (" << string_human_readable_size(max_buffer_size) << ")."; /* Limit to 2gb, as we shouldn't need more than that and some devices may support much more. */ *max_buffer_size = min(max_buffer_size / 2, (cl_ulong)2l*1024*1024*1024);* size_t num_elements = max_elements_for_max_buffer_size(kg, data, max_buffer_size); cl_ulong free_mem_amd = 0; if(clGetDeviceInfo(device->cdDevice, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, sizeof(cl_ulong), &free_mem_amd, NULL) == CL_SUCCESS) { free_mem_amd *= 1024; VLOG(1) << "Free mem AMD: " << string_human_readable_number(free_mem_amd) << " bytes. (" << string_human_readable_size(free_mem_amd) << ")."; if(max_buffer_size > free_mem_amd) { max_buffer_size = free_mem_amd; } } ``` code is from @nirved-1 > > In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see [D2056](https://archive.blender.org/developer/D2056) for difficulties with that, for the split kernel it gets more complicated). Each scene will certainly have it's own optimal memory layout. Some scene have very simple materials/textures, but heavy BVH, some the opposite, etc. Wouldn't it be possible to make a bit like PGO, have a "optimize" button that would render the scene in a special mode, putting the different buffers/textures at different places and save the timings of the different layouts. Then write somewhere in the scene custom data the fastest layout. This one would then be used for all renders of that scene. Of course, it may have to be updated later if big changes are made, but most of the time, it will be used once before sending to the render farm.

mathieu menuet commented

2017-11-06 16:28:37 +01:00

Added subscriber: @nirved-1

Brecht Van Lommel commented

2017-11-06 16:44:52 +01:00

Ok, if the code was modified then indeed a new log would be useful.

PGO gives a poor user experience and is impractical, there's too many combinations to test. We can almost certainly find automatic algorithms that are good enough, just no one has tried yet. For example something like P556 or some variation of it could help keep working memory on the device. Still I don't think we understand the actual issue here, so it's difficult to know what the fix is.

It is not clear how to interpret CL_DEVICE_GLOBAL_FREE_MEMORY_AMD exactly. For example the OS or OpenGL might be using some device memory which the driver can migrate to the host (or discard) to make room for running the OpenCL kernel. So if the driver does that kind of thing, then the second run of the OpenCL kernel may report more free memory, after all memory from the first run was freed. But it doesn't necessarily mean that more memory is actually available.

Ok, if the code was modified then indeed a new log would be useful. PGO gives a poor user experience and is impractical, there's too many combinations to test. We can almost certainly find automatic algorithms that are good enough, just no one has tried yet. For example something like [P556](https://archive.blender.org/developer/P556.txt) or some variation of it could help keep working memory on the device. Still I don't think we understand the actual issue here, so it's difficult to know what the fix is. It is not clear how to interpret `CL_DEVICE_GLOBAL_FREE_MEMORY_AMD` exactly. For example the OS or OpenGL might be using some device memory which the driver can migrate to the host (or discard) to make room for running the OpenCL kernel. So if the driver does that kind of thing, then the second run of the OpenCL kernel may report more free memory, after all memory from the first run was freed. But it doesn't necessarily mean that more memory is actually available.

mathieu menuet commented

2017-11-06 17:10:54 +01:00

@brecht thanks for the patch. Latest master with it (I had to apply manually as it seems it was done on a branch?) gives this log on 3 consecutive renders:
victor_P556.log

@brecht thanks for the patch. Latest master with it (I had to apply manually as it seems it was done on a branch?) gives this log on 3 consecutive renders: [victor_P556.log](https://archive.blender.org/developer/F1099347/victor_P556.log)

mathieu menuet commented

2017-11-06 17:27:22 +01:00

and that's the log with latest buildbot:
victor_8a72be7.log

and that's the log with latest buildbot: [victor_8a72be7.log](https://archive.blender.org/developer/F1099398/victor_8a72be7.log)

mathieu menuet commented

2017-11-06 17:31:12 +01:00

P556 seems to limit the slowdown to about 68seconds from 48 while latest buildbot 8a72be7 goes up to 78sec from 45sec and it's slowdown grows on each new render.

[P556](https://archive.blender.org/developer/P556.txt) seems to limit the slowdown to about 68seconds from 48 while latest buildbot 8a72be7 goes up to 78sec from 45sec and it's slowdown grows on each new render.

mathieu menuet commented

2017-11-12 13:51:21 +01:00

I rechecked with VS2013 builds. The system memory usage varies a bit (max 500MB compared to many GB with 2015) and the performance also is more stable (max 35% variation during 10 renders).
Could someone confirm those behaviours on Windows and test on Linux?

I rechecked with VS2013 builds. The system memory usage varies a bit (max 500MB compared to many GB with 2015) and the performance also is more stable (max 35% variation during 10 renders). Could someone confirm those behaviours on Windows and test on Linux?

Brecht Van Lommel commented

2019-03-19 10:53:35 +01:00

Changed status from 'Open' to: 'Archived'

Brecht Van Lommel closed this issue

2019-03-19 10:53:35 +01:00

Brecht Van Lommel commented

2019-03-19 10:53:35 +01:00

Archiving old report. It may be possible to improve performance for out of core OpenCL renders, but it was never an officially supported feature and I would consider it outside the scope of the bug tracker.

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

[regression] OpenCL performance becomes very random with big scenes. #53249