Page MenuHome

BLI_task: Add new generic `BLI_task_parallel_iterator()`.
ClosedPublic

Authored by Bastien Montagne (mont29) on Jul 30 2019, 3:30 PM.

Details

Summary

This new function is part of the 'parallel for loops' functions. It
takes an iterator callback to generate items to be processed, in
addition to the usual 'process' func callback.

This allows to use common code from BLI_task for a wide range of custom
iteratiors, whithout having to re-invent the wheel of the whole tasks &
data chuncks handling.

This supports all settings features from BLI_task_parallel_range(),
including dynamic and static (if total number of items is knwon)
scheduling, TLS data and its finalize callback, etc.

One question here is whether we should provide usercode with a spinlock
by default, or enforce it to always handle its own sync mechanism.
I kept it, since imho it will be needed very often, and generating one
is pretty cheap even if unused...


Additionaly, this commit converts (currently unused)
BLI_task_parallel_listbase() to use that generic code. This was done
mostly as proof of concept, but performance-wise it shows some
interesting data, roughly:

  • Very light processing (that should not be threaded anyway) is several times slower, which is expected due to more overhead in loop management code.
  • Heavier processing can be up to 10% quicker (probably thanks to the switch from dynamic to static scheduling, which reduces a lot locking to fill-in the per-tasks chunks of data). Similar speed-up in non-threaded case comes as a surprise though, not sure what can explain that.

While this conversion is not really needed, imho we should keep it
(instead of existing code for that function), it's easier to have
complex handling logic in as few places as possible, for maintaining and
for improving it.

Note: That work was initially done to allow for D5372 to be possible... Unfortunately that one proved to be not better than orig code on performances point of view.

Diff Detail

Repository
rB Blender

Event Timeline

One question here is whether we should provide usercode with a spinlock by default, or enforce it to always handle its own sync mechanism. I kept it, since imho it will be needed very often, and generating one is pretty cheap even if unused.

Keep it as simple as possible for the user callback. No need to require them to worry about locks. People will forget doing it, use wrong synchronization mechanisms and so on.

Very light processing (that should not be threaded anyway) is several times slower, which is expected due to more overhead in loop management code.

If it is not to be threaded, where the overhead is coming from?


What is unclear to me is how user's iterator callback is supposed to return? Is it callback who is responsible for partitioning and buffering multiple iterator values into one array?

source/blender/blenlib/BLI_task.h
209

That or this data?

215

Task or thread?

source/blender/blenlib/intern/task.c
1064

Ellipsis -> fullstop,

1294

Maybe user_tls ?

If you are trying to parallelize operations over a data structure like a listbase, and you need some spinlock to iterate to the next item, then I think you have a deeper problem. That is never going to work well on something like a 64 core machine. The data structure should be changed if you want to parallelize operations over it.

So I'm not sure it's a good idea to provide abstractions to support that case, unless I'm misunderstanding the purpose of this patch.

It depends on complexity of the function which handles element.

This will not be a magic bullet, but it might help in some usecases.
To advocate the evil: TBB has such concept, they call it parallel_do. Worth studying what's under the hood there.

https://software.intel.com/en-us/node/506158

Bastien Montagne (mont29) marked 4 inline comments as done.Jul 30 2019, 5:12 PM

Very light processing (that should not be threaded anyway) is several times slower, which is expected due to more overhead in loop management code.

If it is not to be threaded, where the overhead is coming from?

No, the case here is a dummy func just adding a single int value e.g., on a listbase of 10k items. BLI_task code would thread this, because of the amount of items, but performances would obviously be rather bad. So the "that should not be threaded anyway" here meant that dev should never call threaded iterator for such quick task. ;)

What is unclear to me is how user's iterator callback is supposed to return? Is it callback who is responsible for partitioning and buffering multiple iterator values into one array?

If iterator callback needs to return values, then yes, it's its own responsibility (though typically, to avoid locking overhead, best thing to do is have own 'result' struct in the TLS, and if needed, concatenate those results from all processed chunks in finalize callback (which is always ran in calling thread)).

If you are trying to parallelize operations over a data structure like a listbase, and you need some spinlock to iterate to the next item, then I think you have a deeper problem. That is never going to work well on something like a 64 core machine. The data structure should be changed if you want to parallelize operations over it.
So I'm not sure it's a good idea to provide abstractions to support that case, unless I'm misunderstanding the purpose of this patch.

The whole point of those iterators are to iterate over chunks (of tens of items by default) at once, precisely to avoid locking on each and every item. We only lock once to acquire a whole chunk, and then happily work on it in unlocked state since

source/blender/blenlib/BLI_task.h
209

Eeeeeh... Does it matter at all here?

Bastien Montagne (mont29) marked an inline comment as done.

Updated from review.

  • Simplify iteration handling for users of the new BLI_task_parallel_iterator().

Updated against latest master.

Don't mind this at all.
Eventually we would switch the custom operator to some C++-APIification to use TBB's parallel_do, but the code for current task scheduler is already here and seems to be useful to achieve new functionality for parallel_range we've discussed here with Bastien.

P.S. The idea of the change to parallel_for is to "merge" separate invocations of BLI_task_parallel_range into a single pool. See the BKE_subdiv_foreach_subdiv_geometry() where we can achieve extra speedup by traversing verts/faces/etc in the same pool, increasing occupancy of threads.

This revision is now accepted and ready to land.Oct 21 2019, 10:55 AM