Page MenuHome

Particle threading/task optimization
ClosedPublic

Authored by Juan Gea (juang3d) on May 21 2019, 1:39 PM.
Tokens
"Love" token, awarded by 3DSingh."Y So Serious" token, awarded by shader."Love" token, awarded by SavMartin."Love" token, awarded by pablovazquez."Love" token, awarded by fin.eskimo."Party Time" token, awarded by lordodin."Love" token, awarded by ofuscado.

Details

Summary

Checking the particle systems code I located a hardcoded value that limited the maximum amount of particles handled per task to 256.

After suspecting that this was problematic and probably and old hard limit I decided to explore it with great success.

In Blender master as it is right now creating 4 million particles with an i7-7700HQ took 31 seconds, with the initiall modifications done by this patch it took 4 seconds.

@Oscar 'Nebe' Abad (OscarNebeAbad) helped me with the tests and @SavMartin and Zebus3D helped me with the first concepts of this patch.
So far I implemented some logic to balance the amount of particles that different CPU's can handle correctly, so far is working great in the different systems we used to test it :

i7-950 with 48Gb of RAM
i7-7700HQ with 24 Gb of RAM
Threadripper 2990WX with 64Gb of RAM

There is a notable speed increase, at first I thought the speed increase will just affect particle creation time, but it seems it's affecting too simulation baking time, I have not explored that part of the code yet, but I wanted to keep this patch simple since this is improving particle creation a lot.

So far is pretty stable with our tests, but any testing would be very welcome.

It's my first diff so allow me to express my happiness with this, even if in the end this is not included in master.

I hope this is included even taking into account that particles are going to be rewritten, but for the time being this is a huge boost in particle performance, @Sebastián Barschkis (sebbas) tested this concept and he also noticed an speed improvment with mantaflow, @Martin Felke (scorpion81)
will also test this but our first impressions is that as it is right now this is a huge improvement over the hardcoded value we had in the past.

Diff Detail

Repository
rB Blender

Event Timeline

Jacques Lucke (JacquesLucke) requested changes to this revision.May 21 2019, 1:52 PM
  • Generally seems like a good idea to use another number than 256. Can you provide a test file that allows us to easily verify the performance increase?
  • Please remove the old comment, it does not add any value. Also remove the timing code which should not be part of this patch.
  • I think it should not be a define anymore after this change. Just use a normal function call.
  • Please run clang-format (make format) on the changes to ensure that the code style is correct.

Generally I wonder why the max number should be higher when there are more cores.
Can anyone explain why this makes sense? Maybe it's just a coincidence because more cores could mean that the CPU cache is larger?

This revision now requires changes to proceed.May 21 2019, 1:52 PM
  • Can you provide a test file that allows us to easily verify the performance increase?

Yes I will, but so far is simple, you can create a simple sphere, add a particle system, and create 4 million particles in frame 1, you will see speed increase inmediately, but I will prepare a blend so anyone can use the same file.

  • Please remove the old comment, it does not add any value. Also remove the timing code which should not be part of this patch.

Ok

  • I think it should not be a define anymore after this change. Just use a normal function call.

I left the define because I was not sure if that variable was used in some other place, I'm new modifying this kind of code, so you mean that I just create a function "static int MAX_PARTICLES_PER_TASK(void)" and call that function instead of using the variable?

  • Please run clang-format (make format) on the changes to ensure that the code style is correct.

Ok

  • Generally I wonder why the max number should be higher when there are more cores.

Can anyone explain why this makes sense? Maybe it's just a coincidence because more cores could mean that the CPU cache is larger?

We are also not sure, but we reached the same conclussion, the CPU cache, but the test gave us better performance depending on the CPU and teh max amount of particles when working with large amount of particles.

I'll do the changes ASAP and upload them.

Thanks!

I left the define because I was not sure if that variable was used in some other place, I'm new modifying this kind of code, so you mean that I just create a function "static int MAX_PARTICLES_PER_TASK(void)" and call that function instead of using the variable?

You could just do a simple search for that define on the entire code base. It is only used in this one place.
When making it a function, it should have a name that is not all upper case.
Maybe you don't even need this function when you implement the simplification you talked about in blender-chat.

As discussed on blender.chat, for the 4 million particles case and the tested CPUs, I think this logic boils down to:

int numtasks = min_ii(max_ii(BLI_system_thread_count(), 16), endpart - startpart);

Basically, change the strategy from batches of 256 tasks to about 1 task per thread.

This would help avoid the overhead locking that you have with many tasks. The downside is if you have an uneven distribution of work some threads may be idle, but not sure that's much of a concern for particle. And you could probably increase the number of tasks a bit by e.g. 4x without much overhead.

int numtasks = min_ii(BLI_system_thread_count() * 4, endpart - startpart);

Following Bretch suggestion the patch has been super simplified, and it's performance is even a bit better.

I just tested the patch in a Windows 10 64 bits i9 16 threads 32 GB and the difference with 10.000.000 particles is huge. Using current master takes more than 1'5 minutes, but with the new patch is +/- 5 seconds.

This revision was not accepted when it landed; it landed in state Needs Review.May 21 2019, 4:57 PM
This revision was automatically updated to reflect the committed changes.