We assume rendering happens on a device where we can't directly manipulate the memory or call functions, so all communication needs to go through the Device interface.
We've got a few device backends:
- CPU: this device will render on the same CPU, with multithreading.
- CUDA: render on an NVIDIA GPU
- OptiX: render on an NVIDIA GPU, using hardware ray-tracing
- HIP: render on an AMD GPU
- oneAPI: render on an Intel GPU
- Metal: render on macOS
- Multi: balance rendering on multiple devices (GPU+GPU or CPU+GPU)
These devices have methods to:
- Query device information
- Allocate, copy and free memory
- Build BVHs
- Execute kernels (in a queue)
- OpenGL interop for fast display of renders
- Denoising with native APIs
There are a few differences between CPU and GPU devices:
- CPU devices do not have a kernel execution queue
- OpenShadingLanguage is only supported on CPUs currently
Different types of memory is allocated on devices using a few utility classes:
device_only_memory: memory that resides only on the device and is never read by the CPU host, typically working memory for kernels.
device_vector: equivalent of
std::vector, for memory that is shared between CPU and GPU.
device_texture: 2D or 3D image texture, using native GPU texture handles.
Memory must be explicitly copied to and from devices, unified memory is not used currently.
By default, memory operations are performed synchronously on the default GPU queue (or stream). This is used for allocating scene memory and render buffers.
For kernel scheduling, memory allocation and copying should be performed on the GPU queue used for kernel execution. This ensure operations are properly synchronized, and can be performed asynchronously for better performance.
For historical reasons, some memory is encoded in vectors with types
float4 even though a structure would be more
clear. When refactoring an area this can be changed to structs without
Host Memory Fallback¶
GPU devices typically have less memory than a CPU. Scene memory can be automatically moved to host memory for this reason, which allows rendering bigger scenes with slower memory access.
device_texture memory can be moved to the
host. Other working memory is assumed to require fast access and must be
in GPU memory.
GPUs have dedicated hardware for interpolate texture lookups. For this reason each device implements its own texture sampling to take advantage of this. This mechanism is used for both 2D and dense 3D textures.
Sparse 3D textures are be stored and sampled with NanoVDB, also with a device specific implementation.
Multi device abstracts memory allocation and BVH building over multiple devices, multiplexing calls to all devices. Multiple GPUs of the same type can share memory, either with peer-to-peer access or using a host memory fallback.
Kernel execution on the other hand is not abstracted, and each device
must be handled individually. The
integrator/ module handles
scheduling kernel execution, associated memory allocation and denoising
over multiple devices.