Page MenuHome

Cycles: Add support of processor groups

Authored by Sergey Sharybin (sergey) on Jun 3 2016, 6:02 PM.



Currently for windows only, this is an initial commit towards native
support of NUMA.

Current commit makes it so Cycles will use all logical processors on
Windows running on system with more than 64 threads.

Diff Detail

rB Blender

Event Timeline

Couple of notes:

  • This bumps up requirement to WIndows7+, but perhaps we can solve that b doing dlopen() type of thing.
  • Seems it doesn't work really well on a systems with uneven separation of threads between groups, need to doublecheck that.
LazyDodo (LazyDodo) added inline comments.

this overflows when you go over 31 threads per group. 1ull works, but overflows at 64 threads per group.

group_affinity.Mask = (num_threads == 64) ? -1 :  (1ull << num_threads) - 1;

I'd ideally see this patch extended to offer the option of setting a render threads affinity to a specific core in a group (with the idea being, if a thread runs on the same core , it can re-use the l1/l2 caches on the cpu), I ran some tests on an i7-3770 recently and got some small but measurable gains. (Regular being a master build, TA same build but with each render thread assigned a specific core)

bmw 27 cpu
Regular 17:00.12 1020.12s 100%
TA      16.33.25 993.25s  97.4%

Regular 34.52.51 2092.51s 100%
TA      32:37.37 1957.37s 93.5%

Regular 24:29:23 1469.23s 100%
TA      24:03.96 1443.96s 98.3%

could be interesting to re-run a test like that on a larger box to see if it was a fluke or not.

Brecht Van Lommel (brecht) requested changes to this revision.

@LazyDodo (LazyDodo): those are very nice speedups, would be great to see this confirmed. Getting a 6% speedup in a scene like that is usually hard work.

Agreed about the overflow.


Would prefer this code to be structured a bit simpler:

size_t i = 0;

int num_groups = system_cpu_group_count();
for (int group = 0; group < num_groups && i < threads.size(); group++) {
   int system_cpu_group_thread_count(group);
   for (int thread = 0; thread < threads_per_group && i < threads.size(); thread++) {
        threads[i] = new thread(...);
This revision now requires changes to proceed.Jun 4 2016, 12:24 AM

Wow, that's quite measurable speedup!

However, a bit skeptical to just enable such affinity without extra tests. Will do it on various hardware in the studio and if nothing bad happens will enable that.


That was mainly a safety feature to prevent possible regression on nn-NUMA machine, to which i did not have access to when was working on orignial patch..

But now i tested it on laptop and it all works. Will submit simplified patch soon.


That's a good catch!


I just realized that my code doesn't handle the more threads than processors case, but still might be simpler with that I guess.


Not sure what you mean. Thread is a logical processor an included into system_cpu_group_thread_count(). So should be fine.


Sorry I wasn't clear, I meant the user specified num_threads bigger than the number of system threads. Not that this is very useful.


Ok, got it now. Yes, that's something i'm addressing atm. Think it's fine to do some simple-to-read but not very efficient way of dealing with this and just use residual number of threads for the last group (instead of the group size).

Sergey Sharybin (sergey) updated this revision to Diff 6855.

Address review feedback.

Constant thread affinity i'll test later this weekend,
would be better to commit it separately anyway.

Forgot to mentionm, inbalanced groups are now solved as well. So currently my TODO list about this patch is empty.

Seems fine except the Windows 7 minimum requirement, we still support Vista so I guess this needs a solution before it can be committed.

Eeeeh, sorry of the spam...

One thing which is separate form code and everything: would need to clarify whether this breaks Vista, whether we support Vista or whether we need to dlopen something to prevent compatibility issues.

LazyDodo (LazyDodo) added a comment.EditedJun 4 2016, 2:22 AM

I don't have a vista box to test it on, but i'm about 99.9999% sure if one would run this code on vista ,at process creation time you'd get a popup telling you
"Procedure entry point procedure entry point getactiveprocessorgroupcount could not be located in the dynamic link library kernel32.dll" (or any of the other win7+ apis) witht the process terminating after you click ok.

While there might not be many vista users, Vista shares the kernel with windows server 2008, which is probably worth supporting (as in groups don't work, but atleast blender should start) so yeah lower _WIN32_WINNT back to 0x600 and get the functions you need using LoadLibrary/GetProcAddress.

Sergey Sharybin (sergey) updated this revision to Diff 6857.

Use GetProcAddress() to get Group/NUMA specified functions

(well, it's only Group for now, NUMA is on TODO)

Seems fine except the Windows 7 minimum requirement, we still support Vista so I guess this needs a solution before it can be committed.

Well it is not really on topic here stats indicate < 0.7% marketshare for vista.
I personally find testing vista timeconsuming and would really like to see it dropped as a platform. would make us support it till next summer not sure I am up for that

@Martijn Berger (juicyfruit), Current version of patch should keep Vista working.

Brecht Van Lommel (brecht) accepted this revision.
Brecht Van Lommel (brecht) added inline comments.


This revision is now accepted and ready to land.Jun 4 2016, 11:34 PM

This should be wrapped with #if _WIN32_WINNT < 0x0601 as well, otherwise will get compile error once we've switched to Win7+.

This revision was automatically updated to reflect the committed changes.