nvcc can't optimize clamp(x, 0.0f, 1.0f) into single instruction, and uses 4 instructions instead. By using builtin saturate function we could make code a little bit more clean and optimized.
Common diff in PTX (same for SASS), 92 replacements:
< mov.f32 %f6781, 0f00000000; < max.ftz.f32 %f6782, %f1500, %f6781; < mov.f32 %f6783, 0f3F800000; < min.ftz.f32 %f10110, %f6782, %f6783; --- > cvt.ftz.sat.f32.f32 %f10029, %f1500;
Generally looks fine, some inlined question about uchar conversion.
It's also a good idea to include some number into the commit message, like how much % of speedup you've got for some benchmark scene.
Did you test if saturating the whole vector and then converting to uchar gives better performance?