Cycles: Add support for bindless textures.

Authored by Thomas Dinges (dingto) on May 17 2016, 2:30 PM.



This adds support for CUDA Texture objects (also known as Bindless textures) for Kepler GPUs (Geforce 6xx and above).
This is used for all 2D/3D textures, data still uses arrays as before.

User benefits:

  • No more limits of image textures on Kepler. We had 5 float4 and 145 byte4 slots there before, now we have 1024 float4 and 1024 byte4. This can be extended further if we need to (just change the define).
  • Single channel textures slots (byte and float) are now supported on Kepler as well (1024 slots for each type).

ToDo / Issues:

  • 3D textures don't work yet, at least don't show up during render. I have no idea whats wrong yet.
  • Dynamically allocate bindless_mapping array?

I hope Fermi still works fine, but that should be tested on a Fermi card before pushing to master.

Part of my GSoC 2016.

Diff Detail

rB Blender
Thomas Dinges (dingto) retitled this revision from to Cycles: Add support for bindless textures..May 17 2016, 2:30 PM
Thomas Dinges (dingto) added a reviewer: Cycles.
Thomas Dinges (dingto) updated this object.
Sergey Sharybin (sergey) requested changes to this revision.May 17 2016, 5:07 PM

This is annoying to expose device internals to a Scene. Idea:

Have binding mapping in the DeviceCuda, and do something like:

class DeviceCuda : public Device {


  void tex_alloc(const char *name) {
     if (name.starts_with("__tex_image_")) {
       int slot = /* Parse thing like we do for 3d textures. */;
       bindless_mapping[slot] = tex;

  /* Alternatively we can have separate call to sync device-internal stuff to the actual device. */
  void path_trace() {
    /* Sync bindless_mapping to the device. */


  /* Mapping from image slot to a texture object. */
  CUtexObject bindless_mapping[4096];

Other issue here is that you're allocating texture objects, but i don't see where you're freeing them. You have to call cuTexObjectDestroy() when image texture is freeing.

The issue with 3D textures _might_ be related on the wrangler. When building without wrangler it all seems to be working correct. You can test that with WITH_CUDA_DYNLOAD set to OFF. From quick glance not sure what we might be doing wrong there, there's no device/compiler specific stuff in headers.

It might also be a red herring, since compiling in debug mode or with stack protector enabled also solves the issue here.


it's a bit unclear and misleading. Better to have has_bindless_textures or supports_bindless_textures. Otherwise:

a) In some time it'll be unclear why kepler is so special, you'll have constantly remember that it was first card which supported bindless textures.
b) Your Pascal card has bindless textures, but it's not a Kepler apparently.


Indentation here seems to be weird, don't use tabs to indent pass the parent line.


This have to be doublechecked actually.


Tried using CUDA_RESOURCE_DESC resDesc = {0};?

This revision now requires changes to proceed.May 17 2016, 5:07 PM

For the bindless mapping array, I think that can just be a global variable in the kernel like all other data arrays. Just one that isn't in DeviceScene, but internally set by the CUDA device. So basically just adding a device_vector<uint> bindless_mapping; member to CUDADevice.

157 ↗(On Diff #6678)

If we're making slots part of the device API, then I think it makes sense to not only do it for bindless CUDA textures, but for CPU textures too, which are now decoding the slot from the name string.

Doesn't have to be done as part of this patch or even soon, but would make the code nicer I think.

  • Merge branch 'master' into soc-2016-cycles_images
  • Fix some review comments (Variable naming and spacing).
  • Merge branch 'master' into soc-2016-cycles_images
  • Merge branch 'master' into soc-2016-cycles_images
  • Bindless Textures: Move bindless_mapping to CUDADevice.
Thomas Dinges (dingto) marked 2 inline comments as done.May 18 2016, 10:20 PM

Okay, some things here:

  • 3D Textures actually work, it seems the issue is (not bindless related) in the adaptive CUDA kernel.
  • 2D Images work fine too, except float4. This issue appeared after moving bindless_mapping to the CUDADevice. Strangely, float4 works for Smoke, but not for Image Textures.

I get several errors with a float4 image texture:

CUDA error: Illegal address in cuCtxSynchronize()
CUDA error: Illegal address in cuMemcpyDtoH((uchar*)mem.data_pointer + offset, (CUdeviceptr)(mem.device_pointer + offset), size)
CUDA error: Illegal address in cuMemFree(cuda_device_ptr(mem.device_pointer))
CUDA error: Illegal address in cuMemFree(cuda_device_ptr(mem.device_pointer))
I0518 22:17:41.698508 30649 light.cpp:583] Background MIS build time 0.0103688
I0518 22:17:41.698550 30649 device_cuda.cpp:483] Texture allocate: __light_background_marginal_cdf, 16392 bytes.
CUDA error: Illegal address in cuMemAlloc(&device_pointer, size)
CUDA error: Illegal address in cuModuleGetGlobal(&cumem, &cubytes, cuModule, bind_name.c_str())
CUDA error: Illegal address in cuMemcpyHtoD(cumem, (void*)&ptr, cubytes)
I0518 22:17:41.698585 30649 device_cuda.cpp:483] Texture allocate: __light_background_conditional_cdf, 33587208 bytes.
CUDA error: Illegal address in cuMemAlloc(&device_pointer, size)
CUDA error: Illegal address in cuModuleGetGlobal(&cumem, &cubytes, cuModule, bind_name.c_str())
CUDA error: Illegal address in cuMemcpyHtoD(cumem, (void*)&ptr, cubytes)
I0518 22:17:41.699040 30641 blender_session.cpp:550] Total render time: 0.321294
I0518 22:17:41.699064 30641 blender_session.cpp:551] Render time (without synchronization): 0.22552
CUDA error: Illegal address in cuMemFree(cuda_device_ptr(mem.device_pointer))
CUDA error: Illegal address in cuMemFree(cuda_device_ptr(mem.device_pointer))

Any idea?

ToDo: Free the CUtexObject's and check if for viewport render etc, re-uploading the mapping works fine.


Not sure whats wrong here?


This is giving me a compile error.

error: invalid conversion from ‘int’ to ‘CUresourcetype {aka CUresourcetype_enum}’ [-fpermissive]

Not sure what's going on with float4 textures, but some comments.


We don't need two arrays, we can have just a single device_vector<uint> bindless_mapping;.

And then to add slots:

if (flat_slot >= bindless_mapping.size())
    bindless_mapping.resize(max(bindless_mapping.size() * 2, 1));

bindless_mapping[flat_slot] = tex;

Push/pops don't seem to be matched anymore, or at least the logic is hard the follow. Better to do it more often than have confusing logic.


This architecture check is unnecessary as we should never have sync_bindless_mapping = true with other architectures.

And the copying code can be simplified if we have a single array.

Also, this code should be moved to a separate function and called also from shader(), since baking can access textures too.


This should only be added for if needed, and will also not require the number of image textures to be reduced on other cards.

Everything works now (hopefully) :) Also tested Fermi code path by switching defines, so that works as well.

  • Only have one bindless_mapping vector, and also load textures for shader() kernel.
  • Free the CUtexObjects properly.
  • Use some more cuda_pop_context(); calls to make following this easier.
  • Fixes for Fermi.
Thomas Dinges (dingto) marked 4 inline comments as done.May 19 2016, 2:47 AM
Thomas Dinges (dingto) added inline comments.

The removal of the 3 slots here is fine, the /* Kepler and above */ comment was outdated. Atm we have 5 float4 and 88 byte4 textures on Fermi, so 93 slots in total, starting with index 0 we are then at 092. :)

Thomas Dinges (dingto) marked an inline comment as done.
  • Little tweaks for ifdefs.
Brecht Van Lommel (brecht) requested changes to this revision.May 19 2016, 3:54 AM

Seems to be ready except for a few simple things to fix.


Call tex_free(bindless_mapping) here before tex_alloc(), so we don't leak memory on reallocation.


Just have a single cuda_pop_context() above the if().


Remove this, it doesn't correspond to any push.

This revision now requires changes to proceed.May 19 2016, 3:55 AM
  • Adress review comments.

Done, and removed in the else branch below too.

Do we really need to pass flat_slot to the device? Seems a bit redundant to me, especially since it only makes sense for image textures.

Some ideas how to make it more clear:

  • Deduct flat_slot from the name, similar to what's happening around string_printf("__tex_image_%s_3d_%s" ... )

Shouldn't be too bad, allocation is still much slower than parsing a string. And that bit around string_split() can be simplified with just string::rfind()

  • Have a dedicated Device::tex_alloc_image, within which having a slot argument will make much more sense.
  • Split name to name and id, so we'll do something like device->tex_alloc("__float_image", 100500, ...) and merge name and id in the tex_alloc() itself.

Not saying it's essential to do before initial commit an such a refactors can happen afterwards. Just such a particular-device-only-things pierced through generic API are a bit worrying to me. Any thoughts here?

  • Don't pass flat_slot to tex_free(), handle this inside the device.
  • Avoid passing flat_slot to tex_alloc() as well.
172 ↗(On Diff #6699)

This is effectively way to a memory leak, you modified tex_free() signature here, which is now different from base class, meaning this code will never run.


Eventually we can merge this two maps into one like this:

struct MemInfo {
   bool is_array_data;
   CUtexObject bindless_texture;
map<device_ptr, MemInfo> tex_meminfo_map;

This can be done later. For now i would only strongly suggest using CUtexObject instead of uint for non-kernel code.


Use need_ prefix to distinguish flag from method?


Will it work correct if the texture was never allocated?

Use parenthesis btw.


This we should add a check that tex fits into uint and print a warning or abort rendering otherwise. CUtexObject is unsigned long long after all which is not necessarily gonna to fit into a smaller type.


Screw it then for now.


Shall we do a call from thread_run() instead? path_trace() is run for each sample, and trying to load bindless mapping for each of them is a bit redundant.

  • Adress further review comments.
Thomas Dinges (dingto) marked 6 inline comments as done.May 19 2016, 12:48 PM

Yes, because in that case there is no valid mem.device_pointer and the function will do nothing.

This revision was automatically updated to reflect the committed changes.