Cycles: Refactor split kernel and implement for CPU
AbandonedPublic

Authored by Mai Lavelle (maiself) on Nov 11 2016, 2:13 AM.

Details

Reviewers
Brecht Van Lommel (brecht)
Group Reviewers
Cycles
Summary

Turns out that this is more code than I thought it was, so submitting it for review now before things get further along.

This patch refactors/cleans up the split kernel to make it more generic and allow it to be implemented for other devices besides OpenCL. Implementation for CPU devices is included has the same functionality as the OpenCL split kernel. This mainly allows for better debugging of the split kernel, but also has other benefits such as build time compiler errors.

Main things changed:

  • Use one large buffer for all data shared between the kernels instead of many buffers that need to passed around everywhere
  • All kernels now have the same signature, and all kernel logic has been moved from *.cl files into reusable .h files in kernel/split/
  • New class DeviceSplitKernel that holds all host side logic for the split kernel
  • Automatic tile splitting has been removed. The code was a bit messy and couldn't be made generic easily. Unfortunately this means if the tile size is set too large the system can run out of memory (or even hang? need to investigate more)

Would appreciate any comments on style/naming, some of this could probably be rearranged better.

I mentioned this in the commits, but I'll mention it here as well, the split kernel on CPU is a bit slower than the mega kernel (~13% with BMW), this will be investigated later.

Please check the branch for further details.

Diff Detail

Mai Lavelle (maiself) retitled this revision from to Cycles: Refactor split kernel and implement for CPU.Nov 11 2016, 2:13 AM
Mai Lavelle (maiself) updated this object.
Mai Lavelle (maiself) added a reviewer: Cycles.
Brecht Van Lommel (brecht) requested changes to this revision.Nov 12 2016, 6:50 PM

To me this seems like the right direction, but I don't yet understand all the code or implications well enough. Is there a high level design document somewhere with the plan?

For now just minor comments from reading over the code.

intern/cycles/blender/blender_sync.cpp
646–657

Was this intended to be part of the patch? If so will it be removed or will there be a comment?

intern/cycles/device/device.h
216–217

Would use sizeof(global_size) and sizeof(local_size).

269–280

Could this be wrapped in a mem.resize(size) method on device_memory?

295

Can these be private methods? I would like it to be clear these are not part of the public device API.

intern/cycles/device/device_cpu.cpp
234

I rather have explicit Device::mem_alloc where it's needed instead of using.

707–730

Is there some way we can avoid this? From what I can tell we will not get any compile error now if these parameters are out of sync with the function definition.

intern/cycles/device/device_memory.h
216–220

Some of these lines can be removed now since device_memory already does the initialization.

intern/cycles/device/device_split_kernel.cpp
25

Can we put this in util_types.h as a round_up() function, next to align_up()?

intern/cycles/device/opencl/opencl.h
252

This seems rather confusing to me.

intern/cycles/kernel/split/kernel_split_data.h
157–158

Could we make this work like kernel_split_params.x and kernel_split_state. per_sample_output_buffers, similar to kernel_data?

intern/cycles/util/util_memory.h
24 ↗(On Diff #7783)

As I understand it @Sergey Sharybin (sergey) is not a proponent of using this:
https://lists.blender.org/pipermail/bf-committers/2016-October/047730.html

I think unique_ptr is ok, it's the other smart pointer types we need to be careful with, but also don't really care that much either way.

This requires building with C++11 also which is still not ready on Linux as far as I know.

This revision now requires changes to proceed.Nov 12 2016, 6:50 PM

Changing the behavior of atomic_inc is misleading, both CUDA & OpenCL return the old value. Further, using atomic_inc_uint32 and atomic_add_uint32 produces less readable code, since we have to subtract what was just added. Maybe there should be indication in atomic_ops.h, i.e. atomic_add_and_get_uint32.

intern/cycles/util/util_atomic.h
37

Should be

#define atomic_inc(p) (atomic_add_uint32((p), 1) - 1)
#define atomic_add((x) (atomic_add_uint32((p), (x)) - (x))

Changing the behavior of atomic_inc is misleading, both CUDA & OpenCL return the old value. Further, using atomic_inc_uint32 and atomic_add_uint32 produces less readable code, since we have to subtract what was just added. Maybe there should be indication in atomic_ops.h, i.e. atomic_add_and_get_uint32.

This is indeed a confusing situation, Blender's atomics return the new value while most other libraries return the old. Really we just need to choose which direction we want and keep consistent. I think the decision should probably be left to @Brecht Van Lommel (brecht) and @Sergey Sharybin (sergey). Changing the code again isn't a bit deal.

intern/cycles/blender/blender_sync.cpp
646–657

Currently the debug_tile_size is really large and without automatic tile splitting running out of memory is very likely, so for now we just use the user supplied tile size. This can be reenabled after we find a solution to the memory issues. I can add this as a comment.

intern/cycles/device/device.h
269–280

No, device_memory has no way of allocating the underlying buffer, we'd have to make device_memory aware of Device instances for that. I'm also not sure I like this method, but at the time I couldn't think of anything better. I'd be fine with alternative solutions if you have other suggestions.

295

These are called by DeviceSplitKernel so at the very least they need to be visible to that class.

Maybe things could be rearranged so that we have a virtual base class that has these methods, and then any device that implements them can derive from that class in addition to Device? Then you would have clear separation, and DeviceSplitKernel (maybe to be renamed?) could take a pointer of that new class instead of Device. What do you think?

intern/cycles/device/device_cpu.cpp
234

This is needed because mem_alloc is now overloaded and c++ rules for inheritance by default hide all overloads when one is redefined in a derived class. Probably things could be made more clear by choosing a better name for the overload and make it its own method. Maybe something like mem_alloc_device_only?

707–730

There's two things we could do, one is to create a new file just for declaring the function signature, the other is to use a macro to disable the function body in kernel/split/kernel_data_init.h and include that to get the signature. Any preference?

intern/cycles/device/opencl/opencl.h
252

Same as the other use of this, see other note.

intern/cycles/kernel/split/kernel_split_data.h
157–158

Sure, that looks much better.

intern/cycles/util/util_memory.h
24 ↗(On Diff #7783)

I'd rather have unique_ptr than the likely possibility of forgetting to free something. C++11 is working fine on Arch, maybe other distros still need work, @Sergey Sharybin (sergey) any update on this?

In Blender's atomic_ops.h we have two semantics: atomic_add_uint32()and atomic_fetch_and_add_uint32(). First one will perform addition and return new value, latter one will fetch the value and perform addition. We can perhaps make it more obvious name for the first one and call it atomic_add_and_fetch_uint32() (side note: we add and uint32 are just example,s this covers all operations and data types).

Will discuss it briefly with @Bastien Montagne (mont29) and probably just perform a rename in master.

New atomic naming and extra atomic utilities are in the master branch now.

In theory C++11 should be working fine on Linux, but keep in mind, master branch stays at C++0x and not moving to mandatory C++11 any time soon. Only blender2.8 branch will move to it sooner than later.

Something like this should solve the issue for the time being:

1​/*
2​ * Copyright 2011-2016 Blender Foundation
3​ *
4​ * Licensed under the Apache License, Version 2.0 (the "License");
5​ * you may not use this file except in compliance with the License.
6​ * You may obtain a copy of the License at
7​ *
8​ * http://www.apache.org/licenses/LICENSE-2.0
9​ *
10​ * Unless required by applicable law or agreed to in writing, software
11​ * distributed under the License is distributed on an "AS IS" BASIS,
12​ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13​ * See the License for the specific language governing permissions and
14​ * limitations under the License.
15​ */
16
17​#ifndef __UTIL_UNIQUE_PTR_H__
18​#define __UTIL_UNIQUE_PTR_H__
19
20​#if (__cplusplus > 199711L) || (defined(_MSC_VER) && _MSC_VER >= 1800)
21​# include <memory>
22​using std::unique_ptr;
23​#else
24​# include <boost/move/unique_ptr.hpp>
25​using boost::movelib::unique_ptr;
26​#endif
27
28​#endif /* __UTIL_FOREACH_H__ */

Now, about whether to use or not smart pointers. Keep in mind, Blender itself is C-based, with it's strict explicit memory management. Trying to wrap this data into smart points is worst idea ever and will be banned for the indefinite amount of time.

Here it's more tricky situation because it's a separate library/project which could have own rules. That being said, we do use scoped_ptr in Libmv. So here it's more a question about policy we accept in Cycles.

Personally, i'm not really fan of using unique_ptr just to avoid possible leak when forgetting to free the memory. In Blender we use guarded allocation for that and i would not mind if we were using it in more places in Cycles. Not only this gives better idea what the memory is spent to, but also gives nice
human readable report on blender exit. (Doing things like automatic memory managment makes it a bit more tricky to do proper guarded allocation btw. Maybe not in this instance, but in general).

Additionally my worry here is that kernel might need to be freed from within a given context which might be wrong in a destructor code.

intern/cycles/device/device_cpu.cpp
140

This is C++11 as far as i concerned.

669

Picky: kg != NULL.

Updated with a bunch of the requested changes:

  • redid atomics
  • bunch of style fixes and clean up
  • renamed split_state to kernel_split_state
  • removed Device::mem_alloc overload
  • ensured we get errors if data_init parameters fall out of sync
  • removed use of unique_ptr

From the code POV seems fine to me now.

Guess (apart from the already mentioned note about viewport's tile size) most important question to answer now is: what is the expected user-side difference?

intern/cycles/blender/blender_sync.cpp
646–657

Guess that would be a biggest concern for master. It's most likely fine, but IMO better to apply separately.

intern/cycles/device/device_cpu.cpp
161

Don't use empty line at the beginning/ending of the block.

272

Shouldn't we be using aligned alloc?

Aside from the viewport tile size there should be no difference for users from this diff, this is mostly for better debugging and as a cleaner foundation for future work.

I've modified the if statement for the viewport stuff so that it only affects the split kernel. A proper solution will come from the tile rework branch.

Is this finally ready for master?

Changes:

  • merged in current master
  • aligned alloc
  • changed viewport tile size condition for split kernel only
  • removed blank line

If some improvement visible to user (in speed or functionnality like SSS and volumetrics) can be reached for 2.79 with this new code, I would add it to master.
If benefits are only for programmers, I would add it to 2.8. Otherwise, it would just be risking breaking rendering on different configurations, cluttering the bug tracker more than it is already and maybe releasing a 2.79 (which may be used over a long period by many users) that will not be stable.

I would not scatter Cycles development at this point, either we move all Cycles development to the 2.8 branch (which I wouldn't do just yet) or we stick to master. Putting some things into the branch while keeping maintenance and smaller changes in master will only ask for trouble later. Also we still have a few months until a 2.79 release, enough time to fix issues when they come up.

Using the latest version in the branch ( 681089237e33d2fe4b1c46e4a8368c686e60be34 ) I get the following results on 1rx 480 on windows 7 with driver 17.2.1 (pure render times):

  • Barcelona scene at 100spp:
    • 105s compared to 103s with master (2% slower than master) compared to 73s with full selective node compilation (which is 43% faster).
  • BMW scene at 35²spp:
    • 228s compared to 223s on master (also 2% slower) compared to 190s with full selective node compilation (which is 20% faster)

Test with master where made with auto tile size (1 tile) and test with your build where made with 256x256 and 128x128 and I took the best time of both to reflect the performance we should get when auto tile size will be back.

Closing as these changes are already in master.