- User Since
- Mar 31 2015, 9:29 AM (120 w, 5 d)
Thu, Jul 20
Should be working in c1ca3c8.
Thanks, will take another look. Give me a couple of minutes.
@Maxime Michel (maximemichel) if you could please verify that this is fixed with the upcoming automated builds.
Wed, Jul 19
@Brecht Van Lommel (brecht) You're probably right. I think it needs to use the same texture IDs for all CUDA architectures and then map back to SM_2x on the kernel side. I'll go dig out my GTX 460...
The attached scene is missing its textures. Can you please pack them and upload it again?
Tue, Jul 4
One more round of improvements. A few optimisations, and a change to the heuristic for switching between sampling strategies. Now it looks at the triangle's edge lengths instead of its area, that should hopefully help long and thin triangles.
Mon, Jul 3
Changing the sample and pdf calls to forceinline made things work on CUDA.
Sun, Jul 2
This one's not quite ready yet. I'm seeing odd artifacts when using this on CUDA hardware that doesn't show up when rendering on the CPU. Can't say yet what's causing it. This needs some investigation.
Sat, Jul 1
Another update to stay in sync with master.
Fri, Jun 30
A new update, taking Brecht's comments into account and a few other improvements.
Thu, Jun 29
For such a simplistic approach, it works surprisingly well, I have to admit. However, it is very dependent on tile size: if neighbouring tiles have very different numbers of samples, you can end up with visible squares in your render, see the attached example:
Wed, Jun 28
Deformation motion blur is now supported too. Makes quite a difference, as before deformation motion blurred objects were not sampled as light sources.
I added support for instancing and object motion blur. Deformation motion blur is still missing, and I don't think it was supported before either.
Tue, Jun 27
I just noticed a bug in this myself - it doesn't work properly with instanced triangles, as it does not apply the transform. Will try and fix this.
Jun 19 2017
Looks good to me. This is similar to how we handled the situation with the "Physical Root Node" in Poser.
Apr 27 2017
Apr 26 2017
Addressed DingTo's comments.
When building using Xcode 7.2.1, your code throws compiler warnings. Details in the inline code comments.
Addressed Tod's comments, magic number is now replaced with a more descriptive #define.
Apr 25 2017
Addressed Sergey's comments and changed what should (hopefully) make OpenCL work.
I just noticed that this patch doesn't work with OpenCL, so I'll need to address that too.
For what it's worth, here's a Python script that will create thousands of Suzannes and put individual 1x1 pixel textures on them: http://pasteall.org/371017/python
Apr 20 2017
Apr 7 2017
While I'm at it, shouldn't we also replace the various STREQ(snode->tree_idname, ...) calls in node_group.c with the ED_node_is_*() calls for improved readability?
Apr 5 2017
What will it take to get this and D2444 into master? I'd love to be able to have the foundation for many other passes (light groups, Cryptomatte, eventually even LPEs) to be added to Blender and Cycles.
Mar 29 2017
This is also very useful for 3rd party integrations. I've been in a situation before where there were two headers called node.h in the header search path.
Mar 23 2017
For support in OSL, it should only take an update to OpenImageIO 1.7 or newer:
Mar 22 2017
This is an update of the patch against the latest master. It still needs to be changed to share host memory between GPUs where possible instead of creating duplicate allocations.
Feb 17 2017
Oct 6 2016
I have experimented in the same direction. My code was using shared memory, which should in theory be faster than global/local memory. As Brecht mentions, moving through shadow in steps of N intersections at a time could make this work for arbitrary shadow depths. One would need to find the N closest intersections then, which I would try by keeping the intersections in a sorted heap (similar to Jensen's photon mapping code). Then sorting and volumes should also be doable.
Jul 21 2016
Here's an updated patch.
Jun 30 2016
Looks like cuMemAllocHost() is not the same as malloc() followed by mlock(). While malloc/mlock easily allows me to get 14 out of 16GB on my machine, cuMemAllocHost() freezes the entire machine at 12GB.
Jun 29 2016
Right now ray traversal is not making any use of shared memory at all. There may be an opportunity here, either for storing a short stack or the bit mask from the Barringer paper in shared memory. If I understood the Barringer paper correctly, they compared a CPU implementation only. The differences between global/local memory access and shared memory access could give different results in a GPU implementation.
The change to SHADOW_RECORD_ALL wasn't meant as a permanent solution, but to compare how GPU performance would be in the largest Blender scene available to me (will happily render any other scenes). When SHADOW_RECORD_ALL is different for CPU and GPU, we're not tracing the same number of rays, and benchmarks with high levels of transparent shadows will always perform better on the CPU.
Jun 22 2016
My computer just finished rendering the Victor/Gooseberry benchmark scene on two GPUs, coming in at about 1h:36m (conservative tile size of 64x64). So I think my patch is pretty solid.
Jun 19 2016
Jun 16 2016
There is no swapping with pinned memory. That's the whole point about pinned memory - it always remains fixed in physical memory, at the same address, no matter how severe the OS' memory pressure is. That said, it was an SSD.
Octane exposes what they call "out-of-core" textures, which I assume is textures in pinned memory, as a feature that the user has to enable and to pick how much memory it is supposed to use: https://docs.otoy.com/Standalone_2_0/?page_id=3216
Different operating systems may handle this differently, but for what it's worth, I ran a simple test on OS X. cuMemAllocHost() in a loop, allocating 200MB chunks, 100 times - which should try to allocate ~20GB in total. My machine has 16GB of physical memory. At just under 10GB allocated, the machine froze completely. That is, not even the mouse pointer will move and the machine does not respond any more to network requests. After about one or two two minutes of being frozen, the machine rebooted, the "crash" log pointing to the Watchdog task.
Jun 15 2016
From the CUDA documentation it sounds like pinning all available memory may not be a good idea:
It only pins the amount of memory required, no more. The user preference would only set an upper bound to that.
A couple of ideas for improvement:
This should now include all changes, squashed into a single commit.
Yes. I'm still trying to figure out how this system and git patches work. It looks like when I try to upload my diff file with multiple commits in it, it takes only the first one.
This seems to be solid enough to allow me to launch a render of the Gooseberry benchmark scene on my 4GB GPU with TDR turned off. However, some tiles render extremely slow (counting seconds per sample instead of samples per second!), I haven't found out yet what crazy things happening in them.
OK, here you are: https://developer.blender.org/D2056
I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.
The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.
Jun 14 2016
I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?
PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.
Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.
Jun 13 2016
May 24 2016
May 19 2016
May 18 2016
May 17 2016
This updated patch removes the now unused ray_quad_intersect() and ray_triangle_intersect().
May 16 2016
Here is a before/after comparison with an area light:
The same shader on a point light:
May 15 2016
Nov 27 2015
Sorry for not responding earlier, there were a number of other things to work on. Sergey, is this still relevant or do your latest changes to SSS take care of this?
Nov 20 2015
Here we go. This would be my proposed fix. I hope my code style isn't too far from your standards.
Nov 19 2015
Querying for CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES (via cuFuncGetAttribute) on a machine with SM 5.2, returns 89024 bytes before the patch and 75056 after the patch. So it's about 15% less local memory per kernel thread.