[regression] OpenCL performance becomes very random with big scenes.
Open, NormalPublic

Description

System Information
Win7 x64, Vega64, driver 17.10.2

Blender Version
Broken: e8daf2e
Worked: 2.79

Short description of error
When rendering the same frame of the same scene multiple times with OpenCL with the same instance of Blender, render times vary from 100% to >300%. Restarting Blender and rerendering gives the same time as the first time

Exact steps for others to reproduce the error

I did it on a small border render part at low resolution to have about 20 tiles of 64x64 with fur and grass. The first render after (re)starting was always 29seconds long, the next ones were random, between 45seconds and 102seconds.
The same test with 2.79 gave 145.6seconds +/-1% on 3 consecutive renders without restart. So the bug appeared after 2.79. Something was changed in allocation or global size calculation maybe?

Details

Type
Bug

Note that in all above cases, system memory is used as the scene doesn't fit in the dedicated 8Gb memory. So it doesn't seem to come from how the drivers allocates the memory between dedicated and system memory.
Also, power usage of GPU was reduced to ensure no throttling happens, frequency were stable during all the tests.

mathieu menuet (bliblubli) renamed this task from OpenCL performance becomes very random with big scenes. to [regression] OpenCL performance becomes very random with big scenes..Sat, Nov 4, 9:49 AM
Brecht Van Lommel (brecht) triaged this task as Normal priority.Sat, Nov 4, 6:27 PM

I thought Victor only rendered on AMD after rBec8ae4d, not yet in the 2.79 release?

In any case, we never added explicit support for using system memory, and leave it totally up to the driver to decide which memory to move where. If it decides to e.g. put image textures in VRAM and keep the BVH or tile render buffers in system memory, that could cause big performance differences.

I guess the first step would be to git bisect to where the problem started. I don't have an AMD card with HBCC support though. Testing with the OpenCL context cache disabled could gives some clues if it's something in the context that leaks or has an unintended lasting effect, or if it's something else.

rBec8ae4d5e9f7 only added support for more than 4GB of textures iirc.
You don't need HBCC support. On win7, even on Vega, there is no HBCC and my RX480 also renders full victor scene since a year on windows and since some months on Linux.
@Brecht Van Lommel (brecht) Is there a simple command to disable the context caching?
I could try to bisect, but @Mai Lavelle (maiself) should have better guesses of what could have introduced this bug. The scene preparation of Victor takes more than 2 minutes on my computer. With compile time on windows on top, shooting in the dark to bisect would take a lot of time.

just to give an idea of the mess to bisect:

  • cuda disables completly opencl in the majority of revision, so you have to rebuild without cuda
  • device selection changed, so userpref have to be modified depending on the revision you test and bisecting requires to go back and forth in time.
  • kernel compilation takes 1min50 for victor
  • scene preparation takes 2min04

so it takes about 5minutes of VS compile, then manual tweaks for user pref, then 2minute kernel compile+ then 2 renders at 2 (scene preps)+2(render)=8minutes of rendering. That's a quarter of an hour with 4 user intervention between which you can't do much.

So here is my contribution after an hour of work: the bug was already there 29.09.2017

the bug was already there 24.08.2017, so my guess is that rBec8ae4d5e9f7 is the commit we look for.

This comment was removed by mathieu menuet (bliblubli).

got some explanations on IRC, sorry I didn't know the whole story.

commit b53e35c655d4 already has the bug, so it's not due to the buffer patch.

actually, 2.79 has the bug, only the official one had the device selection bug and took the 1080Ti instead, which doesn't use system memory.
So it may be a driver bug, but then why is the first render always 30sec?
After some renders, I got up to 114seconds to render = nearly 3x slower... At this point however, the GPU was idling a lot, maybe waiting all the time for system memory access?
Here is a picture of the task manager with 2 consecutive renders on the same instance of Blender.


It may be a coincidence, but VS2013 builds had only +/-10% between first and consecutive renders (made 5 of them) while VS2015 builds go crazy with up to 3x the render time.
If someone could test on Linux with a RX480 to see if GCC or the Linux driver handles this differently. As said before, the RX480 can render this scene. On Linux, the Nvidia drivers destroy a part of the AMD driver and I couldn't find a solution to have both drivers side by side yet.

Could be interesting to look at the output of gpu-z to see what the cards actual memory is doing in between those two runs.

Thanks for the tests! To be clear I'm not expecting anyone to work on this bug, and if no one else does I'll probably do it at some point, but the work is certainly helpful.

The graph from the task manager is interesting. It is only showing host memory so it doesn't give the whole picture, but it does look like there is no significant host memory leak after the first render. The profiles are similar for both renders, only at the start of the second render there seems to be an extra bump. Perhaps we can spot a corresponding allocation in the output of running with --debug-cycles.

If not I guess it's something internal in the OpenCL driver, for which I don't think there is any debug output we can look at? Maybe it's the driver deciding to migrate some device memory back to host memory, possibly memory that we leaked from the previous render? I couldn't spot any memory leaks in the OpenCL device code, and it's not clear to me what exactly could be hanging around in the context.

Here's a patch to disable the context cache: P555.

Eventually we should probably use clEnqueueMigrateMemObjects to explicitly tell the drivers which buffers should go on the device and which on the host. But I expect there's some other issue going on here.

@LazyDodo (LazyDodo) the GPU-Z log is wrong somehow, it ignores half of the memory. But it gives the impression that no memory leak happens on the GPU.

@Brecht Van Lommel (brecht) here is a log of 2 consecutive render

Contrary to GPU-Z, here Cycles reports that free memory is different and calculate a very different global size, which is known to impact performance a lot. The strange thing is that the second render reports more free memory (about 4Gb against 1Gb). It results in a bigger global size, which should speedup the rendering, but as most of the data is then in system memory, it waits most of the time.

So it could look like the second time for some reason the driver decides to put some buffers in system memory. However, if we compare with the task manager graph, memory usage between first and second render is more in the +8gb range, while cycles reports only 3GB (from 1 to 4GB) more as free on the GPU...

If someone has a direct wire to the AMD driver team, that would be great to tell them about this bug.

@Brecht Van Lommel (brecht) thanks for P555 . Tried it but bug is still there.

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

Yes, I used another version to get the free memory reported and tried to see if limiting global size to make it all fit in memory would solve the problem, but it didn't. I can redo the log with vanilla master if you want. Here is the code:

		VLOG(1) << "Maximum device allocation size: "
		        << string_human_readable_number(max_buffer_size) << " bytes. ("
		        << string_human_readable_size(max_buffer_size) << ").";

		/* Limit to 2gb, as we shouldn't need more than that and some devices may support much more. */
		// max_buffer_size = min(max_buffer_size / 2, (cl_ulong)2l*1024*1024*1024);

		// size_t num_elements = max_elements_for_max_buffer_size(kg, data, max_buffer_size);
		cl_ulong free_mem_amd = 0;
		if(clGetDeviceInfo(device->cdDevice, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, sizeof(cl_ulong), &free_mem_amd, NULL) == CL_SUCCESS) {
			free_mem_amd *= 1024;
			VLOG(1) << "Free mem AMD: "
			        << string_human_readable_number(free_mem_amd) << " bytes. ("
			        << string_human_readable_size(free_mem_amd) << ").";
			if(max_buffer_size > free_mem_amd) {
				max_buffer_size = free_mem_amd;
			}
		}

code is from @Hristo Gueorguiev (nirved)

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Each scene will certainly have it's own optimal memory layout. Some scene have very simple materials/textures, but heavy BVH, some the opposite, etc. Wouldn't it be possible to make a bit like PGO, have a "optimize" button that would render the scene in a special mode, putting the different buffers/textures at different places and save the timings of the different layouts. Then write somewhere in the scene custom data the fastest layout. This one would then be used for all renders of that scene. Of course, it may have to be updated later if big changes are made, but most of the time, it will be used once before sending to the render farm.

Ok, if the code was modified then indeed a new log would be useful.

PGO gives a poor user experience and is impractical, there's too many combinations to test. We can almost certainly find automatic algorithms that are good enough, just no one has tried yet. For example something like P556 or some variation of it could help keep working memory on the device. Still I don't think we understand the actual issue here, so it's difficult to know what the fix is.

It is not clear how to interpret CL_DEVICE_GLOBAL_FREE_MEMORY_AMD exactly. For example the OS or OpenGL might be using some device memory which the driver can migrate to the host (or discard) to make room for running the OpenCL kernel. So if the driver does that kind of thing, then the second run of the OpenCL kernel may report more free memory, after all memory from the first run was freed. But it doesn't necessarily mean that more memory is actually available.

@Brecht Van Lommel (brecht) thanks for the patch. Latest master with it (I had to apply manually as it seems it was done on a branch?) gives this log on 3 consecutive renders:

and that's the log with latest buildbot:

P556 seems to limit the slowdown to about 68seconds from 48 while latest buildbot 8a72be7 goes up to 78sec from 45sec and it's slowdown grows on each new render.

I rechecked with VS2013 builds. The system memory usage varies a bit (max 500MB compared to many GB with 2015) and the performance also is more stable (max 35% variation during 10 renders).
Could someone confirm those behaviours on Windows and test on Linux?