Page MenuHome

Cycles optimization: move srgb/alpha conversion out of cycles kernel
Closed, ResolvedPublic

Description

Looking at top 5 functions in profiler for pavillon_barcelone_v1.2 (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with gcc --march=native):

Function NameCPU Time by UtilizationInstructions RetiredCPI RateCPU Frequency Ratio
ccl::bvh_intersect_instancing9987.16s267595720000001.383321.05899
>__ieee754_powf1933.63s66456495000001.07511.05571
>ccl::svm_image_texture1078.16s17376415000002.295781.05716
ccl::kernel_path_integrate824.122s25369365000001.203031.0581
ccl::shader_setup_from_ray809.146s16276960000001.844611.06019

As you can see from table, powf calls are too expensive even for Haswell.

Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if use_alpha flag from SVM stack is set) and converts the result from srgb to linear (if srgb flag from SVM stack is set) -- see svm_image_texture. Therefore cycles kernel produces billions of color_srgb_to_scene_linear calls, which use powf. Is far as I can see, both use_alpha and srgb flags are seem to be constants: only EnvironmentTextureNode::compile/ImageTextureNode::compile set them and only svm_node_tex_environment, svm_node_tex_image_box and svm_node_tex_image decode them from SVM stack.

If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects.

Few notes:

  1. In theory, interpolation between pixels gives different results in linear space (right now cycles interpolates in srgb space). This difference is tiny and only noticeable for extremely lowres textures.
  2. interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y)), AFAIK
  3. If user places the same image in the node tree, but with different settings (e. g. Color and Non-color data), then a copy of image should be created.

No patch yet, waiting for Brecht's comment.

Details

Type
Patch

Event Timeline

Sv. Lockal (lockal) set Type to Patch.
Sv. Lockal (lockal) created this task.
Sv. Lockal (lockal) claimed this task.
Sv. Lockal (lockal) raised the priority of this task from to Needs Triage by Developer.

It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem.

It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.

@Sv. Lockal (lockal) how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf?

i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons.

But is there any speed to be gained from using a powf that is just good enough but faster ?

Brecht Van Lommel (brecht) triaged this task as Normal priority.

Test patch for powf replacement. Uses speculative initial guess based on float representation, and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture.

Nice patch.
On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).

was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.

Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark"

Ivy Bridge Quad Core (3.4 GHZ)
Ubuntu Linux 12.10, x64
gcc 4.7.2

Vanilla master: 08:40min
With patch: 8:00 min

So I can confirm your 7% here @Sv. Lockal (lockal), nice work!

Nice work!

  • Could you make a color_srgb_to_scene_linear that takes a float4, so svm_image.h just calls this function and the rest is hidden in util_color.h?
  • This code assumes that pow with constant arguments will be constant folded. Can we trust visual studio 2008 to do this? You could make that value a template parameter to be sure.
  • We don't current have unit tests, if you want to create a test directory that with a c++ file that includes util_color.h, but it's up to you if you want to do this.
  1. We already have color_srgb_to_scene_linear(float3 c), but why this function lies inside #ifndef __KERNEL_OPENCL__? Is something wrong with float vectors with opencl? Also note that svm_image_texture is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result of pow changes in this patch.
  2. Good idea. C++ templates do not support float as template parameters, so I'll fold float into hex constant and add a comment
  3. It's ok as long as we can attach files here. It would be better to make not-so-cryptic code by moving common SSE block into utils_simd.h (I'll move blend(mask, a, b) for now).

Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more.

Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.

New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.

blender trunk 3 versions

I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z

Ill test tomorrow

@Sv. Lockal (lockal)

win64 release mode:
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 2.0467e-019
error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014
error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010
error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005
error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755
error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431
error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010
error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015
error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016

win32 release mode
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 1.5333e-019
error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014
error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010
error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005
error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843
error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364
error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010
error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014
error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016

flags used:
cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp

Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles.

http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark

http://martijnberger.nl/file/win64-vc12_Lockal.7z

Trunk 07:44.73
With patch 07:36.33

Intel i5 3770K
8GB
Windows 8 Ultimate

BTW. Trunk on Linux 06:12.52
Cheers, mib.

Regarding #ifndef KERNEL_OPENCL for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.

Further this looks good to me, if you want the commit the patch go ahead.

Pow patch committed in rB96903508bc3faec99bac8007e344016698630aae.

I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.

Sv. Lockal (lockal) closed this task as Resolved.Jan 6 2014, 7:34 PM

Commited ccl::svm_image_texture code as rBacc90b40bff5a15604c4d98692ff3ba32fe44603. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).

Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1.

and here's

the complete VS2013 project / solution for those that want to optimize further.

fastpow24 pow:
Domain from 1.39e-014 in 7.15sec
error max =   0.89      avg = -0.454    |avg| =  0.465  to 9.1e-010     in 6.92
sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 5.96e-005
in 1.97 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to   3.91
in 1.98 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 2.56e+005
in 1.97 sec
error max = 6.1e-007    avg = 5.21e-008 |avg| = 1.08e-007       to 4.29e+009
in 1.73 sec

fasterpower24 pow:
Domain from 5.14e-012 in 7.77sec
error max =    0.7      avg = 0.0015    |avg| = 0.00641 to 3.37e-007    in 6.21
sec
error max = 9.5e-007    avg = 3.39e-010 |avg| = 1.52e-007       to 0.0221
in 1.73 sec
error max = 1e-006      avg = 5.76e-010 |avg| = 1.52e-007       to 1.45e+003
in 1.72 sec
error max = 1e-006      avg = 7.89e-010 |avg| = 1.52e-007       to 9.49e+007
in 1.73 sec
error max = 1e-006      avg = 6.33e-010 |avg| = 1.52e-007       to 2.87e+009
in 0.53 sec

Classic powf:
Domain from 1.53e-019 in 2.09sec
error max =      1      avg = 0.0113    |avg| = 0.0207  to 1e-014       in 2.96
sec
error max = 3.1e-006    avg = -2.55e-006        |avg| = 2.55e-006       to 6.59e
-010    in 2.03 sec
error max = 2.1e-006    avg = -1.49e-006        |avg| = 1.49e-006       to 4.32e
-005    in 2.04 sec
error max = 1e-006      avg = -4.32e-007        |avg| = 4.41e-007       to   2.8
3       in 2.03 sec
error max = 1.2e-006    avg = 6.26e-007 |avg| = 6.26e-007       to 1.85e+005
in 2.04 sec
error max = 2.3e-006    avg = 1.68e-006 |avg| = 1.68e-006       to 1.21e+010
in 2.04 sec
error max = 3.3e-006    avg = 2.74e-006 |avg| = 2.74e-006       to 7.96e+014
in 2.03 sec
error max = 3.6e-006    avg = 3.4e-006  |avg| = 3.4e-006        to 1.13e+016
in 0.491 sec

An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed.

jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller.

The input of color_srgb_to_scene_linear is not limited to 1 in case of EXR images. However it is possible to call a specialized function in svm_image_texture for byte images and a generic function for float images.

The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results. 0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x has max error of 0.0001590453, but I think it is still too big.

Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly.

EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway.

Here are a couple of other approximations:

Optimizing conversion between sRGB and linear

sRGB Approximations for HLSL

and I expect that you will have seen

Optimizations for pow() with const non-integer exponent?

A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.