Page MenuHome

Cuda use streams and async to avoid busywaiting
Needs RevisionPublic

Authored by Martijn Berger (juicyfruit) on Jan 27 2014, 11:19 PM.

Diff Detail

Event Timeline

I also tested with 2 and 3 cards and in both cases blender uses close to no cpu time during rendering.

I still need to test on windows but I am confident about that.

Brecht Van Lommel (brecht) requested changes to this revision.Jan 28 2014, 2:15 PM
Brecht Van Lommel (brecht) added inline comments.
intern/cycles/device/device_cuda.cpp
660

Is this needed still? Seems like this synchronizes things twice now?

Is render time affected or is performance the same? I would expect it to be ok but it's good to check anyway.

Martijn Berger (juicyfruit) updated this revision to Unknown Object (????).Jan 28 2014, 3:06 PM

@Brecht Van Lommel (brecht) you where right about double synchronization

Windows 7 x64, i7 Quad Core, Geforce 540M

Vanilla
CPU usage: ~15%.
Render time: 1:37min

With Patch
CPU usage: < 5%.
Render time: 1:41min

bmw.blend, 128x TileSize.

It is not without a very minor performance impact

When setting the tiles size very small and having 1000 tiles I am measuring ~3 to ~5ms per tile in extra cost latency. I think that this is due to how the async stuff is implemented.
I measures 4.23 seconds loss on a benchmark that takes 10 minutes when using 16x16 tile size.

I think this is worth it as typical Cuda setup would use large tiles and then this cost amounts to way less then 1%.

intern/cycles/device/device_cuda.cpp
660

I guess that "cuda_assert(cuCtxSynchronize())" can just be eliminated.

From my preliminary testing i could not get a measurable performance impact (I only tested BMW with large tiles)
I could try and render a large image with very small tiles to hit synchronization point many times per second

Ok, that seems acceptable.

Probably what we have to do to avoid this performance loss (and improve performance probably) is to queue multiple tiles to render on the same GPU, by using multiple streams. Doug Gale and I made a patch for this some time ago but never finished it. Here's the old diff

in case anyone wants to experiment with that, it won't apply cleanly in the current code but gives an idea.

I think we can also use the same stream for all operations as now it seems we have 2 streams. 1 for the blocking api that is implicit and an explicit stream.

Also ill look at Doug's work and take some things from there. From memory the multiple tiles per device did also have other advantages.

Hi, tested with patch give small performance differences, depends on scene:

Vanilla patch
BMW
30.36 29.22

My Bench
1:12.44 1:05.35

The Tee
2:25.95 2:12.34

Opensuse 13.1/64
Intel i5 3770K
GTX 760
GTX 560Ti 448 Cores
Driver 331.20

CPU usage 15-20% from 400%

Cheers, mib.

@Brecht Van Lommel (brecht) should I just merge this as is and is that allowed with current bconf level

You can merge it, it's almost a bugfix.

reopened as of rB6b1a4fc66e

Cycle CUDA: revert the f1aeb2ccf4 and 84f958754 busywait fixes for now.
It's unclear what kind of impact they have on performance at the moment, so I
rather play it safe and postpone this for 2.71. T38679, T38712

Martijn Berger (juicyfruit) updated this revision to Unknown Object (????).Feb 19 2014, 4:49 PM
Martijn Berger (juicyfruit) updated this revision to Unknown Object (????).Feb 19 2014, 10:04 PM

I changes code so that after 10 samples it calculates the amount of samples it might do in 1000 milliseconds and then it forces a sync. and evaluates again how far from target value it is and so on. it should sync about once every 1 seconds but we might change that.

For real scenes I get a minor speed up over master. And tiles update. they might update a bit less often

This revision is now accepted and ready to land.Nov 21 2014, 5:05 PM
Martijn Berger (juicyfruit) edited edge metadata.

Cleaned up patch as it is reverted for good reasons.

Todo:

  • Cleanup whole sync notion and maybe add explicit device toggle

Update the patch to work with current master

Brecht Van Lommel (brecht) requested changes to this revision.EditedNov 5 2016, 9:47 PM
Brecht Van Lommel (brecht) edited edge metadata.

I'm wondering if this is actually thread safe. We are writing multiple samples to the same pixels in parallel, without any atomic operations. I would think that can fail, especially with small tiles? Or are these kernel launches already guaranteed to run one after the other on the GPU?

Maybe we should add atomic operations for writing to passes in the kernel, I think we'll need to do this sooner or later anyway.

intern/cycles/device/device_cuda.cpp
95

Would make this a double value in seconds for simplicity.

220

Use CU_EVENT_BLOCKING_SYNC rather than 0x1.

This revision now requires changes to proceed.Nov 5 2016, 9:47 PM

Thinking this need further updates:

  • either use streams properly or revert to default stream semantics
  • explore using a stream callback.