This code allows to push a set of different operations all based on
iterations over a range of indices, and then process them all at once
over multiple threads.
This is mainly interesting for relatively low amount of individual
tasks, as expected.
E.g. performance tests on a 32 threads machine, for a set of 10
different tasks, shows following improvements when using pooled version
instead of ten sequential calls to BLI_task_parallel_range():
|10K||365 us||138 us||2.5 x|
|100K||877 us||530 us||1.66 x|
|1000K||5521 us||4625 us||1.25 x|