[cycles] add an nvrtc based cubin cli compiler.
AcceptedPublic

Authored by LazyDodo (LazyDodo) on Thu, Nov 9, 4:40 AM.

Details

Summary

First of all, this is nowhere near ready for master, but it is a working proof of concept, i'd like some feedback on.

nvcc is notoriously picky on the version of the host compiler, which is annoying cause most of the cycles code has no dependency on the host headers at all, nor do we use mixed cuda/native code. The host compiler should be mostly irrelevant for cycles. On windows things are worse, because nvidia tends to drag their heels on supporting newer visual studio versions (if took over 7 months to support msvc 2017, once support came out, they supported the initial release, with update 1 support 'in beta', however by the time cuda 9 came out, msvc was on update 5.....) which pretty much makes it impossible for us to use a recent compiler.

This patch adds a cli compiler called cycles_cubin_cc that is pretty much a replacement for nvcc using the nvrtc library.

noteworthy things

  • I only tested it on windows.
  • nvrtc is only available on x64, for x86 builds we'll have to build a 64 bit cycles_cubin_cc , however since I was lazy, the proof of concept depends on cycles_util and oiio's parameter parsing libs, we can't expect users to check out 2 lib folders just to build this thing.
  • nvrtc is used though cuew, I think license wise we should be able to ship it (if we chose to do so).
  • Currently it only supports buildtime cubin generation, i guess it could support on demand, but i haven't implemented it yet.
  • The cuda toolkit still needs to be installed completely for it to work for the following reasons:
    • cycles needs cuda.h
    • nvrtc lacks a linker (it outputs ptx) so it shells out to ptxas to link the final cubin (we could do this though the cuda driver api, however ptxas will work on hosts without the nvidia driver installed like our buildbots, so it's the better choice)
  • There's probably a ton of codestyle violations, sorry!

todo

  • Cleanup cycles_util/oiio dependencies so it can be build for an x86 build.
  • Preferably take the same arguments as nvcc so it'll be drop in replacement.
  • nvidia added a verison number to the nvrtc dll filename, should probably find a nicer way to deal with this inside cuew
  • test on linux
  • I would like this to become the standard compiler on windows, so we are no longer held hostage by cuda's msvc support (or lack there of)

performance

(on a sm_30 card)

scenenvcccycles_cubin_cc
pabbelon_barcelona23:22.2322:42.62
koro14:14.7614:16.96
fishy_cat09:59.9209:56.98
classroom12:52.1512:48.05
bmw2704:34.3504:30.17

given this is rather polluted devbox, i'd say the perf is +- identical to nvcc

thoughts?

Diff Detail

Repository
rB Blender
Branch
arcpatch-D2913_1
Build Status
Buildable 959
Build 959: arc lint + arc unit
Brecht Van Lommel (brecht) requested changes to this revision.Thu, Nov 9, 1:58 PM

This is great. It also makes runtime/adaptive compilation for CUDA feasible I guess, if we want to officially support that at some point.

There are very few users with both 32 bit and an NVidia card supported by CUDA 9, so I wouldn't mind dropping support for that if it's a burden.

It worked for me on Linux after some tweaks. We could do this to get rid of the dependency on CUDA headers too:

diff --git a/intern/cycles/kernel/kernel_compat_cuda.h b/intern/cycles/kernel/kernel_compat_cuda.h
index fa512f8..c7a8f11 100644
--- a/intern/cycles/kernel/kernel_compat_cuda.h
+++ b/intern/cycles/kernel/kernel_compat_cuda.h
@@ -30,10 +30,20 @@
 #  define __NODES_FEATURES__ NODE_FEATURE_ALL
 #endif

-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <float.h>
-#include <stdint.h>
+typedef unsigned short half;
+typedef unsigned long long CUtexObject;
+typedef unsigned int uint32_t;
+typedef unsigned long long uint64_t;
+
+#define FLT_MAX 1.175494350822287507969e-38f
+#define FLT_MIN 340282346638528859811704183484516925440.0f
+
+__device__ half __float2half(const float f)
+{
+       half val;
+       asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(val) : "f"(f));
+       return val;
+}

 /* Qualifier wrappers for different names on different devices */
intern/cycles/app/cycles_cubin_cc.cpp
64

These should use typedef I think.

65

This expands to #define FLT_MAX FLT_MAX here, also would be missing the f postfix if it worked. We can just hardcode it:

"#define FLT_MAX 1.175494350822287507969e-38f\n#define FLT_MIN 340282346638528859811704183484516925440.0f\n"
176–184

These environment variables are Windows only. I had to set these manually to /usr/local/cuda/[include|bin]. Also LD_LIBRARY_PATH to /usr/local/cuda/lib64.

273

This should be true?

This revision now requires changes to proceed.Thu, Nov 9, 1:58 PM

This is great. It also makes runtime/adaptive compilation for CUDA feasible I guess, if we want to officially support that at some point.

It's more of a bonus, the main motivation was to get rid of cuda holding us hostage in regards to compiler versions.

There are very few users with both 32 bit and an NVidia card supported by CUDA 9, so I wouldn't mind dropping support for that if it's a burden.

Yeah, while i do think that is ok for not supporting runtime/adaptive compile on x86, I can imagine a lot of upset users if we stopped shipping cubins with x86 windows builds, on the upside, once ironed out i see this code rarely change, so we could just toss a precompiled version in the x86 libdir..

It worked for me on Linux after some tweaks. We could do this to get rid of the dependency on CUDA headers too:

+typedef unsigned short half;
+typedef unsigned long long CUtexObject;
+typedef unsigned int uint32_t;
+typedef unsigned long long uint64_t;
+
+#define FLT_MAX 1.175494350822287507969e-38f
+#define FLT_MIN 340282346638528859811704183484516925440.0f
+

this would pretty much eliminate the hacky defines in the code, i like it.

intern/cycles/app/cycles_cubin_cc.cpp
65

The definitions for msvc are different (FLT_MAX 3.402823466e+38F , FLT_MIN 1.175494351e-38F), i initially hard-coded those values but figured that might cause incompatibility with gcc, so i figured lets just inject the compiler defines, cause those surely can't be wrong. I have no strong opinion here, whatever works.

176–184

Yeah it is probably a better idea to pass these as parameters, I'm not overly comfortable system-ing an environment variable in the first place.

273

yes it should be.

Even more so, we could use this for the on-demand kernel compilation in device_cuda.cpp. Since it is possible to load PTX directly from the CUDA runtime (with the driver taking over the ptxas step), this should enable on-demand compilation on any operating system without any dependencies other than the CUDA driver.

Even more so, we could use this for the on-demand kernel compilation in device_cuda.cpp. Since it is possible to load PTX directly from the CUDA runtime (with the driver taking over the ptxas step), this should enable on-demand compilation on any operating system without any dependencies other than the CUDA driver.

Not entirely true the nvrtc stuff doesn't ship with the driver for some idiotic reason, and due to our license we can't include the needed shared libs in our package.

  • Cleanup the target_link_libraries for cycles_cubin_cc so it only depends on the needed libraries
  • remove the use of environment variables, you can now specify the cuda tookit rootdir with a parameter.
  • added support for cuda8 nvrtc in cuew
  • Applied patch by @Brecht Van Lommel (brecht) to lose the dependency on cuda.h
  • added verbose flag in cmakelists.txt so we can see the register spilling information again.
  • flipped the compilation order of kernel and filter, since filter builds in seconds and I got tired of waiting on kernel to finish when testing small changes to cycles_cubin_cc

the exit(0) for debug builds is still a mystery to me, it's like oiio managed to allocate some memory bypassing the guarded allocator?

  • Automatically enable cycles_cubin_cc when WITH_CYCLES_CUDA_BINARIES is on and msvc 2017 is detected.

Looks generally fine, I'll update the diff with some code style and Linux fixes if you don't mind.

I can remove the dependency on cycles_util to avoid the issue with the debug build.

intern/cycles/app/cycles_cubin_cc.cpp
181

Is this to make it find the nvrtc-builtins library?

On Linux the best solution I could find so far is to set LD_LIBRARY_PATH, seems there is no way to specify it at runtime from within cycles_cubin_cc.

Looks generally fine, I'll update the diff with some code style and Linux fixes if you don't mind.

Don't mind at all, go at it!

intern/cycles/app/cycles_cubin_cc.cpp
181

cuew will try to call loadlibrary on nvrtc64_90.dll , for it to find it there's 3 options (ignoring a bunch of crazy registry hacks i really don't want to do)

  1. it's in the path or system32 folder, we have no LD_LIBRARY_PATH, on windows there's a single 'path' environment variable for all the things! with a neat limitation of 1024 chars, no hard limit, just when you go over windows gets very 'unhappy' , so I don't like adding things to that. Adding things to system32 is also frowned upon nowdays.
  1. the dll sits in the same folder as the executable requiring it, we'd copy over the dll's to our output folder, we could probably do this in cmake if we wanted to, but we might get the desire to ship these with blender which license wise we really can't do, best to avoid temptation here and not have them in the same folder.
  1. we hint the loader with a single additional path to look in SetDllDirectory (xp sp1 and up, you can't add multiple paths until windows 8 where they introduced AddDllDirectory)

I guess we could add a parameter to cuew_init with the basepath to the cuda toolkit (don't want to rely on the environment vars since it makes switching between toolkits difficult), and so some work there to locate the library, and call loadlibrary with a full path, I assume that loading a library with a full path will also sidestep the LD_LIBRARY_PATH issue on linux?

intern/cycles/app/cycles_cubin_cc.cpp
181

A full path doesn't work on Linux, there libnvrtc.so seems to internally dlopen libnvrtc-builtins.so without a full path. I thought you had the same issue on Windows.

The current solution for Windows seems ok to me. I wish we had an equivalent of SetDllDirectory on Linux, but it doesn't seem to exist.

intern/cycles/app/cycles_cubin_cc.cpp
181

i think calls to setenv are only for the lifetime of the process, maybe something like this might do the trick for you?

setenv("LD_LIBRARY_PATH",getenv("LD_LIBRARY_PATH")+";"+settings.cudatoolkitdir + "/bin");
intern/cycles/app/cycles_cubin_cc.cpp
181

It doesn't work, the LD_LIBRARY_PATH environment variable is read once when the process starts.

intern/cycles/app/cycles_cubin_cc.cpp
181

not out of idea's yet!

  1. would detecting it being missing from LD_LIBRARY_PATH , setting it and fork()ing work?
  1. Could we set it in cmake before the call to cycles_cubin_cc?

2 seems the neater solution...

Brecht Van Lommel (brecht) edited the summary of this revision. (Show Details)
  • Fix missing -D flags compared to nvcc
  • Set LD_LIBRARY_PATH to find libnvrtc-builtins on Linux
  • Use vector<> instead of malloc()
  • Remove dependency on cycles_util
  • Code style tweaks

LD_LIBRARY_PATH is set automatically in CMake now. It's not too bad, just if we do runtime compilation with libnvrtc in the future it's not ideal.

This revision is now accepted and ready to land.Sun, Nov 12, 6:52 PM

Fix build with nvcc.