Cycles Networking support
Needs RevisionPublic

Authored by Lukas Stockner (lukasstockner97) on Aug 24 2017, 10:26 PM.

Details

Summary

This patch contains a number of changes to the Cycles Network rendering code that make it actually work.

Supported are:

  • Rendering on another machine
  • Rendering on multiple machines using the MultiDevice
  • CPU and CUDA device on servers
  • Viewport rendering (although latency is an issue here)
  • Hiding network latency by keeping a tile queue on each server to allow releasing/acquiring in a background thread while the worker threads continue
  • Automatic detection of servers using either UDP broadcast or, if broadcasting is not supported on the network, making the servers announce their presence to a certain host

Not supported yet are:

  • Denoising, could be added easily, but I think latency would make tile mapping a significant bottleneck on multiple servers. Since denoising usually is much faster than rendering, a reasonable fallback might be to make the host machine denoise locally.
  • Cross-platform networking, at least in my tests Boost Archives don't work between Linux and Windows. An interesting alternative might be Protocol Buffers, especially since C++11 will allow us to get rid of the other parts of Boost.
  • Multicasting, currently the data is sent to each server individually
  • Error handling, it usually just crashes when something goes wrong
  • OpenCL, renders a black image
  • MultiDevices on servers, so only a single GPU can be used. To support that, we'd need to go away from passing Subdevice pointers to acquire_tile to a Device ID system that supports nesting MultiDevices (one on the host, one on servers).

Future improvements:

  • Loading images locally on the servers. While the actual rendering is generally native speed due to the tile queue system, the sync stage takes really long especially with bad bandwidth. The largest part in most scenes are images, which coincidentally also generally are the most static data across frames/rerenders. Therefore, it makes sense to copy it to the servers once and then load it from the disk locally.
  • Cascading data distribution. In networks that don't support multicast, the sync time currently is O(n) w.r.t. the amount of nodes. By implementing a cascading system where the host first sends data to server 0, then to server 1 while 0 sends the data to 2 etc. that could be reduced to O(log n) as long as the switches etc. hold up to the traffic.
  • Servers joining/leaving mid-render. In theory, that wouldn't be too hard - for joining, the host remembers which buffers are currently allocated, and when a server joins it is sent the data followed by the current task, and for leaving the host would need to keep a list of tiles currently on each server and redistribute those to others when a server disconnects. That would allow for great flexibility - when you notice that the render is too slow, you can just add instances based on demand. Also, supporting this is important for e.g. using AWS spot instances effectively.
  • Adaptive queues. Currently, a rather simple heuristic is used to determine the amount of tiles queues on each server, but by measuring tile times and latency the actual number of needed tiles could be determined.
  • Networking protocols. Currently the code is written for TCP/IP, but only a very small part of the code actually cares about that, so it should be easy to add support for stuff like MPI or even Infiniband etc. in the future.
  • Message queues. The current code disables the Nagle algorithm since it caused huge latency (100ms roundtrip on localhost) in code that sends a packet and then waits for a reply (since the OS waits for a while before actually sending). However, the general idea of it makes sense and the communication pattern indeed shows sequences of sent packets that don't actually require waiting for a reply. Therefore, it might make sense to implement a local message queue that is simply flushed as soon as a receive call happens, or explicitly after sending a packet that requires an answer.
  • Instead of making network rendering an option like CPU and GPU, make it a checkbox that enables Network Devices in addition to the selected local device.

How to use it:

  • Enable WITH_CYCLES_NETWORK (no need for standalone)
  • If you're on Windows (and OSX I guess), you need to get libboost-serialization - for now, downloading the appropriate Boost 1.60 binary release from http://boost.org and just copying the two libboost-serialization libraries into the Blender lib folder seems to work.
  • Build
  • Run cycles_server (--help shows the options, default is CPU rendering with auto thread count). Optional: Set the environment variable CYCLES_PORT to listen on a nonstandard port
  • Run Blender
  • Set the device to Network Rendering
  • Set the server option below to either a semicolon-separated list of IPs (or <ip>:<port> if you use a non-standard port) or "WAIT<x>" where x specifies how long you want to wait for autodetection
  • Hit F12 or enable viewport rendering

Network rendering obviously still needs a lot of work to be usable in production, but for testing purposes it works well - in a test on AWS with 9 instances, the full BMW29 benchmark rendered in 17 seconds with near-perfect scaling and no tile queue underruns.

Diff Detail

Repository
rB Blender
Branch
network_master (branched from master)
Build Status
Buildable 776
Build 776: arc lint + arc unit

Very cool!

I like the progress and QBVH changes, I guess those could be committed separately. The changes in device_network.cpp seem pretty safe to commit as well since that is not compiled by default anyway.

Generally agree with your other comments, some notes:

  • Boost archives: would be happy to get rid of that. Not really familiar with protocol buffers, seems a little complicated to me? I'm wondering if there isn't something simpler we can implement ourselves in a few hundred lines of code without introducing library dependencies. The other big dependency is boost asio, not sure what the good alternatives are there. It might become part of c++20 or something but that's still a long time away.
  • Loading images: this is indeed a concern, especially when image caching comes into it. The most generic solution I guess would be to still pass everything the same as other memory buffers, but then do some automatic caching to disk on the server side, maybe even with compression or rsync-like partial updates. If the serves already have a copy of the image or there's a fast network drive setup it would be faster to use that, but that has its own kind of failure cases too.
  • Cascading data distribution: yes, I guess this would be the common method rendering over the internet. Honestly I was mainly expecting this to be for LAN initially, but if it works on AWS or similar as well that's great.
  • Checkbox that enables Network Devices in addition to the selected local device: maybe this is best as a user preference too? If not, I guess we will want to make CPU and GPU checkboxes as well eventually.
intern/cycles/app/CMakeLists.txt
92

This won't work for the cycles standalone repository.

intern/cycles/app/cycles_server.cpp
48–59

We could install cycles_server and cycles into 2.xx/scripts/addons/cycles/bin folder, so all the files are together and this hack can be avoided.

intern/cycles/app/cycles_standalone.cpp
379

Should document here to use semicolons for multiple servers.

intern/cycles/blender/addon/properties.py
167–172

I think this should be a user preference, doesn't make much sense to save in each .blend I think.

Potentially you could have a way to save multiple named servers or combinations with local devices in the user preferences for easy switching, though an addon could do that as well.

intern/cycles/device/device.h
330

Use const string& for passing strings here and in other places.

intern/cycles/device/device_network.cpp
83

Would be good to comment why this is done.

90

No need for \n here I think.

100–104

We already need to keep these struct layouts compatible between CPU and GPU, so it's kind of surprising to me that this would be an issue. Or are we talking here about different WITH_CYCLES_XXX compilation options?

I guess in general we need to be careful with different Blender/Cycles versions and build options.

intern/cycles/device/device_network.h
90

Convenient for debugging, but I guess we don't want to commit this.

Sergey Sharybin (sergey) requested changes to this revision.Aug 25 2017, 9:16 AM

Oh, nice to see progress in this area!

Some general notes:

  • Configuration interface is rather clumsy for users. Would think having a discovery like took when client sends request for all alive servers and shows them in a list is more friendly. Surely, there are cases when you can't do this, but that's where you fallback to manual configuration when needed.
  • Is this implementation robust enough for network hickups?
  • How server handles concurrent requests for render from multiple artists?
  • How do you handle versioning of Cycles itself? Do you allow rendering in whatever Cycles version?
  • Are you transferring compressed data? Or at least over a compressing tunnel?
  • Multicast is kind of essential. Without the latter two points, network here in the studio will surely die: transferring gigabytes of data to reach of cycles server...
intern/cycles/app/CMakeLists.txt
39

This shouldn't be needed.

intern/cycles/app/cycles_server.cpp
48–59

Either that (but i'm skeptical about putting random binary to an addons folder (even tho this is compile time). This is also what distro packages will forbid you to do.

Also, i don't want to start asking users "hey, you should run binary from this addon folder". So either it's a part of Blender itself (see below) or the whole Cycles server is distributed on it's own where you don't need any specifics for Blender paths,

Think ideally server should be starting from inside Blender, similar to framebuffer. You'll then have nice interface to configure all server-side aspects without mocking around with command line arguments.

88

Should be broadcasting, not announcing to a single address. I don't see how announce to a single address is helpful.

Binding interface setting also seems to be missing?

intern/cycles/device/device.h
58

Ouch, why are we putting specific-device-dependent settings to a base class?

273

I don't like it at all. If you're refactoring this, make it more generic rather than QBVH specific. Otherwise it'll be yet-another-refactor needed.

intern/cycles/device/device_network.h
136

Seems you're on refactoring this file as well. Please move all implementation away from header file to implementation file.

This revision now requires changes to proceed.Aug 25 2017, 9:16 AM

Heres a video of the amazing network render working on 640 threads like magic

https://youtu.be/kOyamxpD_r0

This looks cool. Unfortunately the patch doesn't apply cleanly for me. Does this Differential system make it possible to tell what commit the patch is based on?

I'm testing a network/cluster rendering solution in an environment that doesn't allow broadcast or multicast. Please keep this type of environment in mind as this work progresses. My solution uses Docker-based render nodes that register themselves with a manager web app.

Does this Differential system make it possible to tell what commit the patch is based on?

I see it now. I don't know how I missed it before.

Hello,

I gave this a try and it works really great.

I only found a couple of issues :

  • baking with 1 local cycles_server is quite slower than using the cpu (or gpu directly)
  • baking with n cycles_server is even more significantly slow : maybe the baking process generates a lot of network traffic ? I wasn't sure how to interpret the results
  • the "preview" render only works when there's 1 cycles_server running. As soon as there are more, the preview screen gets horizontally split and of course is not correct

I can't wait to see this going into production!