This patch contains a number of changes to the Cycles Network rendering code that make it actually work.
- Rendering on another machine
- Rendering on multiple machines using the MultiDevice
- CPU and CUDA device on servers
- Viewport rendering (although latency is an issue here)
- Hiding network latency by keeping a tile queue on each server to allow releasing/acquiring in a background thread while the worker threads continue
- Automatic detection of servers using either UDP broadcast or, if broadcasting is not supported on the network, making the servers announce their presence to a certain host
Not supported yet are:
- Denoising, could be added easily, but I think latency would make tile mapping a significant bottleneck on multiple servers. Since denoising usually is much faster than rendering, a reasonable fallback might be to make the host machine denoise locally.
- Cross-platform networking, at least in my tests Boost Archives don't work between Linux and Windows. An interesting alternative might be Protocol Buffers, especially since C++11 will allow us to get rid of the other parts of Boost.
- Multicasting, currently the data is sent to each server individually
- Error handling, it usually just crashes when something goes wrong
- OpenCL, renders a black image
- MultiDevices on servers, so only a single GPU can be used. To support that, we'd need to go away from passing Subdevice pointers to acquire_tile to a Device ID system that supports nesting MultiDevices (one on the host, one on servers).
- Loading images locally on the servers. While the actual rendering is generally native speed due to the tile queue system, the sync stage takes really long especially with bad bandwidth. The largest part in most scenes are images, which coincidentally also generally are the most static data across frames/rerenders. Therefore, it makes sense to copy it to the servers once and then load it from the disk locally.
- Cascading data distribution. In networks that don't support multicast, the sync time currently is O(n) w.r.t. the amount of nodes. By implementing a cascading system where the host first sends data to server 0, then to server 1 while 0 sends the data to 2 etc. that could be reduced to O(log n) as long as the switches etc. hold up to the traffic.
- Servers joining/leaving mid-render. In theory, that wouldn't be too hard - for joining, the host remembers which buffers are currently allocated, and when a server joins it is sent the data followed by the current task, and for leaving the host would need to keep a list of tiles currently on each server and redistribute those to others when a server disconnects. That would allow for great flexibility - when you notice that the render is too slow, you can just add instances based on demand. Also, supporting this is important for e.g. using AWS spot instances effectively.
- Adaptive queues. Currently, a rather simple heuristic is used to determine the amount of tiles queues on each server, but by measuring tile times and latency the actual number of needed tiles could be determined.
- Networking protocols. Currently the code is written for TCP/IP, but only a very small part of the code actually cares about that, so it should be easy to add support for stuff like MPI or even Infiniband etc. in the future.
- Message queues. The current code disables the Nagle algorithm since it caused huge latency (100ms roundtrip on localhost) in code that sends a packet and then waits for a reply (since the OS waits for a while before actually sending). However, the general idea of it makes sense and the communication pattern indeed shows sequences of sent packets that don't actually require waiting for a reply. Therefore, it might make sense to implement a local message queue that is simply flushed as soon as a receive call happens, or explicitly after sending a packet that requires an answer.
- Instead of making network rendering an option like CPU and GPU, make it a checkbox that enables Network Devices in addition to the selected local device.
How to use it:
- Enable WITH_CYCLES_NETWORK (no need for standalone)
- If you're on Windows (and OSX I guess), you need to get libboost-serialization - for now, downloading the appropriate Boost 1.60 binary release from http://boost.org and just copying the two libboost-serialization libraries into the Blender lib folder seems to work.
- Run cycles_server (--help shows the options, default is CPU rendering with auto thread count). Optional: Set the environment variable CYCLES_PORT to listen on a nonstandard port
- Run Blender
- Set the device to Network Rendering
- Set the server option below to either a semicolon-separated list of IPs (or <ip>:<port> if you use a non-standard port) or "WAIT<x>" where x specifies how long you want to wait for autodetection
- Hit F12 or enable viewport rendering
Network rendering obviously still needs a lot of work to be usable in production, but for testing purposes it works well - in a test on AWS with 9 instances, the full BMW29 benchmark rendered in 17 seconds with near-perfect scaling and no tile queue underruns.