Cycles Devices¶

Overview¶

We assume rendering happens on a device where we can't directly manipulate the memory or call functions, so all communication needs to go through the Device interface.

We've got a few device backends:

CPU: this device will render on the same CPU, with multithreading.
CUDA: render on an NVIDIA GPU
OptiX: render on an NVIDIA GPU, using hardware ray-tracing
HIP: render on an AMD GPU
oneAPI: render on an Intel GPU
Metal: render on macOS
Multi: balance rendering on multiple devices (GPU+GPU or CPU+GPU)

These devices have methods to:

Query device information
Allocate, copy and free memory
Build BVHs
Execute kernels (in a queue)
OpenGL interop for fast display of renders
Denoising with native APIs

There are a few differences between CPU and GPU devices:

CPU devices do not have a kernel execution queue
OpenShadingLanguage is only supported on CPUs currently

Device Memory¶

Different types of memory is allocated on devices using a few utility classes:

device_only_memory: memory that resides only on the device and is never read by the CPU host, typically working memory for kernels.
device_vector: equivalent of std::vector, for memory that is shared between CPU and GPU.
device_texture: 2D or 3D image texture, using native GPU texture handles.

Memory must be explicitly copied to and from devices, unified memory is not used currently.

By default, memory operations are performed synchronously on the default GPU queue (or stream). This is used for allocating scene memory and render buffers.

For kernel scheduling, memory allocation and copying should be performed on the GPU queue used for kernel execution. This ensure operations are properly synchronized, and can be performed asynchronously for better performance.

For historical reasons, some memory is encoded in vectors with types like uint4 or float4 even though a structure would be more clear. When refactoring an area this can be changed to structs without performance loss.

Host Memory Fallback¶

GPU devices typically have less memory than a CPU. Scene memory can be automatically moved to host memory for this reason, which allows rendering bigger scenes with slower memory access.

Only device_vector and device_texture memory can be moved to the host. Other working memory is assumed to require fast access and must be in GPU memory.

Textures¶

GPUs have dedicated hardware for interpolate texture lookups. For this reason each device implements its own texture sampling to take advantage of this. This mechanism is used for both 2D and dense 3D textures.

Sparse 3D textures are be stored and sampled with NanoVDB, also with a device specific implementation.

Multi Device¶

Multi device abstracts memory allocation and BVH building over multiple devices, multiplexing calls to all devices. Multiple GPUs of the same type can share memory, either with peer-to-peer access or using a host memory fallback.

Kernel execution on the other hand is not abstracted, and each device must be handled individually. The integrator/ module handles scheduling kernel execution, associated memory allocation and denoising over multiple devices.

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations