Cycles Image Reconstruction kernel
Open, NormalPublic

Description

Hi,

this is the Image Reconstruction kernel modifications.

included:

  • a weight pass, used to weight other "marked" passes
  • more filters, Mitchell, Sinc, Triangle
  • as this is only the Cycles part, here is how I write the image from my side:
	BufferParams& params = rtile.buffers->params;
	int x = params.full_x - options.session->tile_manager.params.full_x;
	int y = params.full_y - options.session->tile_manager.params.full_y;
	int w = params.width;
	int h = params.height;

	RenderBuffers *buffers = rtile.buffers;

	/* copy data from device */
	if (!buffers->copy_from_device())
		return;

	float exposure = scene->film->exposure;
	std::vector<float> pixels(params.width*params.height * 4);
	std::vector<float> pixels_weight(params.width*params.height);

	buffers->get_pass_rect(PASS_WEIGHT, 1.f, rtile.sample, 1, &pixels_weight[0]);
	if (buffers->get_pass_rect(PASS_COMBINED, 1.f, rtile.sample, 4, &pixels[0]))
	{
		for (size_t i = 0; i < pixels_weight.size(); ++i)
		{
			if (pixels_weight[i] != 0.f)
			{
				float invWt = 1.f / pixels_weight[i];
				pixels[i * 4] *= invWt;
				pixels[i * 4 + 1] *= invWt;
				pixels[i * 4 + 2] *= invWt;
				pixels[i * 4 + 3] *= invWt;
				pixels[i * 4 + 3] = saturate(pixels[i * 4 + 3]);
			}
			else
			{
				pixels[i * 4] = 0.f;
				pixels[i * 4 + 1] = 0.f;
				pixels[i * 4 + 2] = 0.f;
				pixels[i * 4 + 3] = 1.f;
			}
		}
		update(&pixels[0], x, y, w, h);//function callback
	}

// NOTE: I disabled scale inside get_pass_rect()
  • my test case: background = true, filename.empty() = true, progressive_refine = false
  • ImageReconstructionSampleBoundaries needs a tiny redesign "it is 33 variables, possibly remove 1 or padd it"

known issues that needs a design decision "possibly Sergey can continue from this point" :

  • as the tile data is kinda private "not writing to a global buffer" , so there is artifacts between tile borders "I'm clamping edges"
  • - solution: tile buffer can be a little larger "old range: (sx -> sx + sw, sy -> sy + sh), new range: (sx - filter_width -> sx + sw + filter_width, sy - filter_width -> sy + sw + filter_width)
    • and use this range inside the kernel call (sx, sy, sw, sh)
    • in the write function "to blender" , there will be a global buffer that is accumulating data "as the tile buffer extended ranges will be overlapping"

I tested on CPU and CUDA "3.5", windows 8, i7 3930k, GTX 780., should work fine on OpenCL "not tested"

cheers,
Mohamed Sakr

Details

Type
Patch
Mohamed Sakr (msakr) set Type to Patch.

ops, I missed and put it in BF Blender, should have chosen BF Cycles.

forgot to add the diff,


sorry for this, didn't know where to add the diff, my first time here.

may be any admin recreates the whole task or update it.

  • also disabled Saturate inside get_pass_rect() for alpha, Sature is done after weight denormalizing

Could you please fix the patch ?

It is pretty much unreadable and very very hard to spot what you are changing.

sorry for being noob, my first time to do a diff
so I got many commits, I squashed them into 1, did a diff from the master

what am I missing?

ok, I figured out how, thanks Martjin :)
here is the correct diff,

General code feedback:

  • Please follow the existing code style
  • Please do not add your name in the comments, if the patch is accepted authotship of the code is easy to find via git blame

Regarding this particular change:

  • It's not really clear how to check it and what to expect. After applying this patch default cube scene is becoming more and more red with more samples
  • Mitchell filter i'm not sure is used correctly here
  • Atomics are in fact quite bad. They're stopping all conveyors, introducing really bad latency issues with lots of threads
  • Not even sure why do you need to do it in the kernel itself
  • I'm not sure why it is getting red in your case "it is fine here" , it is related to the write part in your side "as I changed some stuff in the get_rect(), they should accordingly adjust some stuff inside the write_cb() and update_cb() functions
  • I used the Mitchell filter function inside PBRT as it is
  • not sure if there is an alternative to atomics "without using CUDA dynamic parallelism"
  • why it is done in the kernel, let me compare: (pseudo code)

a) done as a filter (compositing):
pixel_value = sum(neighbor pixel_value * neighbor pixel_weight) / neighbor_pixel_count
weight here is calculated from the pixel center

b) done inside the kernel:
for(i in samples) { pixel_value += sum(neighbor pixel_value * neighbor pixel_weight)

pixel_weight += sum(neighbor pixel_weight} }

pixel_value /= pixel_weight //normalizing if weight != 0

here it depends on per sample position (weight) and color, you can see the difference from the equations.

conclusion: the difference is:(compositing filter is working on pixel centers and average values(using sum)), (kernel filter is working on pixels float position and exact pixel values(using sample value))

just tell me what you want me to update, I will do the general code feedback, what else?

I see the filter result is not correct, the only place where I suspect is the filter table itself
in PBRT the filter table is just the function, in Cycles the filter table is done using a long process and MIS stuff.

I will create another filter table similar to PBRT and re-test.

1 note: kernel filter enhances AA, compositing filter doesn't affect AA that much.

I strongly suggest to not add any filters with negative weights to Cycles, like Mitchell or sinc. In my opinion they only add problems and I would not suggest anyone to use them in practice. The triangle filter is also not useful. Blackman–Harris could be good to add, but it's the only one I can think of.

With a path tracer like Cycles there's enough AA samples that you don't need to artificially sharpen the image, and deal with problems like ringing or clamping negative values.

Mohamed Sakr (msakr) added a comment.EditedJul 24 2015, 4:04 AM

@Brecht Van Lommel (brecht) , what do you think about kernel filter vs compositing filter? "I prefer kernel filter as stated above"
also you got a good point about the negative weights of Mitchell
what about atomics? any idea to avoid them? only 2 things that I think of are:
1- another buffer to write to (bad for memory, also bad as it is global memory write), or
2- shared memory atomics "faster on Maxwell, slower on Kepler"
or anything that you may suggest.

The "compositing filter" looks like a blur operation you could do in the compositor, there's no point doing that in Cycles.

My understanding is that this is an alternative sampling method to the filter importance sampling that we use, and the only way to do that correctly is to work on individual AA samples. Both this code and the existing method should converge to the same result. There are some pros and cons as discussed in the filter importance sampling paper, which converges better depends on the scene. Also interesting to note filter importance sampling is required for denoising as added in PRMan.

In any case, I think that before accepting this patch there should be evidence that it's actually significantly better for some example scenes, and the extra render time of padded tiles is worth it. Also because the implementation adds significant complexity to the code. I have a feeling that with the high number of AA samples typically used in Cycles scenes (and more and more in production scenes elsewhere), filter importance sampling might be pretty competitive.

I don't know what the best way is to implement this. If you "scatter" using atomics you will suffer from contention but the code stays relatively simple. The alternative is to split it into two stages, one where you write unfiltered samples into a buffer and then a second where you "gather" the samples and filter them, so that each thread can write to its own pixel without any contention. But the first thing for me would be to prove that this method is helpful and then worry about optimization.

Mohamed Sakr (msakr) added a comment.EditedJul 24 2015, 9:10 AM

I fixed the problem now, added another table without MIS, and results seems correct
here are 4 images, 16 spp, 64 spp, with and without filter, filter is Gaussian , width = 1.5


render time difference is nothing!! "but as this is only for 1 pass, it needs testing with all passes active and using filter, and possibly with the 2 pixels border extension to avoid bucket seams"

@Brecht Van Lommel (brecht), thanks for the reply and explanation, don't have much time for the tracker this days..

@Mohamed Sakr (msakr),

I don't think the implementation totally matched explanation (unless i'm missing something). Filtering is still happening purely inside pass buffers which doesn't have spatial information from the sample. So unless you're re-generating filter kernel for each AA sample (which is gonna to be rather really bad) then i don't see how you exactly exploiting spatial information?

By not doing the filter in the kernel i didn't mean to do it in post-pro. This should have been read: don't do this in path tracing kernel, but better to have a dedicated kernel for this which will avoid use of atomics. Should be easy to do with the current formulation of the kernel and shouldn't give much overhead even on CUDA with dynamic kernel launching.

As for the example images, if i decoded what's on them correct (adding proper labeling would have been helped..) then the filtered result is far too fuzzy, not really acceptable.


But now, i'm not convinced this is a way we should go at all. To me it seems instead of doing all sort of weird and wonderful stuff in Cycles itself it should be done as a post-pro and Cycles should just be giving required information, like sample spatial information, variances and so on.

That being said, the simplest ever thing to do is to support FSAA in Cycles. It'll improve artistic pipeline, especially with hairy characters in the scene.

With some additional thoughts and generalization it shouldn't be a rocket science to go to a design when Cycles gives required information and all actual filtering happens outside of render kernel (in Blender it could happen in compositor, but actual logic for filtering could still be implemented from Cycles side).

@Sergey Sharybin (sergey), inside kernel_camera.h, camera_sample(), here it generates weights "and that's why I couldn't make them in another kernel" , as each pixel is giving 25 weights to neighboring pixels, so it doesn't depend only on a single weight and a single L, it depends on L and 25 weights per pixel.

this test was just a rough test to show the idea, 1.5 filter width is quite large "0.75 should give great results"
about the image names, images with name _64 are 64 spp, name _filter is using the filter, other images are 16 spp

My understanding was that this is an alternative importance sampling method for pixel filters basically, which can be less noisy especially with low AA samples. FSAA would indeed be great and give some of the same advantages, though I guess for the case of e.g. 1024 AA samples you can't store them all and would still need to do some filtering immediately.

The example renders do not look like they would converge to the same result though? The ones with this patch look much blurrier.

True, but think trick to try is something inbetween of real FSAA and current approach, like storing several (8, 16 32?) buffer with samples which are distributed close to each other. Roughly speaking render higher resolution with less AA samples per pixel (probably some weighting tricks are needed as well). Then compositor either deals with each individual buffer extracted from the bigger one or just do downsample with a nice kernel.

This would also help for denoising by giving extra information if noise is caused by bad convergence or it's just a high frequency texture applied on the mesh. Or even help with importance map calculation for adaptive sampling. things like that.

And in any case, even if we apply filter on each of those AA samples i'm not really sure the patch will give measurable improvement.

here is a test from PBRT2, with and without filter, Mitchell 1.5

What does "without filter" mean, a box filter? The difference between those two images is too big to be explained by just a different pixel filter, did these renders use the same number of samples?

The edges seems much sharper in the PBRT example. Seems either number of samples if different in there, or it's some rather extreme edge preserving filtering in there (which is less likely because of high frequency noise appearing in the filtered image).

Please make sure you're comparing apples to apples. In terms same exact scene, same exact sampling/light pass settings and just different filter size and no extra biasing settings.

Mohamed Sakr (msakr) added a comment.EditedJul 24 2015, 6:34 PM

same number of samples "32 spp", the noisy one is using a very small filter "which is actually not calculated to neighboring pixels by the filter function, as the filter size in the noisy one is 0.4, in the good one is 1.5, both is using Mitchell"

everything in that scene is the exact same, only difference is the filter radius "0.4 and 1.5"

and by the way, this is the only reason why I started this patch, as I saw how effective it is in PBRT2

also this filtering style I saw it in many renderers, like v-ray, Arnold

Ah, but that's not a valid comparison. You're making the filter smaller than a pixel, but PBRT is still sampling the entire pixel, which means most of the samples fall outside of it (or at least have a very small weight). So you're effectively rendering with fewer samples and it gets more noisy.

In fact this is a case where the filter importance sampling as used by Cycles would render with a lot less noise than PBRT, though a filter size of 0.4 is not something you would use in practice.

I see, but what if we use both?
filter MIS reduces noise "variance"
but it doesn't address AA "you still see very sharp diagonal lines even if you are using 10000 spp"
here this filter can give you the interpolation between pixels "more blurry", so it enhances AA look

may be using a Gaussian filter width of 2 for MIS, and width of 1 for kernel filter? like this the blur will be only for 3x3 pixel area, which would give less blurry results, but more visually pleasing than 0 pixel interpolation colors.

Well there's two things that should be considered separately. One is the importance sampling method which affects how quickly the image converges. If it looks visually pleasing or not does not matter, because if the implementation is correct it will look exactly the same with many AA samples.

The other is the choice of pixel filter which affects how it looks when converged. For that I think it's worth trying Blackman-Harris, which is like Gaussian but supposedly has less aliasing.

so what shall I do now "if the MIS sampling filter is better, then this task will be useless"
I guess I remove filter from inside the kernels, and just add Blackman-Harris filter function to the code.

what I want to implement inside Cycles:
1- http://sglab.kaist.ac.kr/~bcmoon/ which got http://sglab.kaist.ac.kr/LinearPrediction/ "first paper in the page right now"
I think this is the paper that you referenced that is used in PRMan.

I checked already the previous paper (Adaptive rendering based on Weighted Local Regression) "WLR" , and actually improved it inside PBRT2 "Moon et al." has released the code in PBRT2, and it was really a good filter, but it was only usable for uniform noise distributions + diffuse direct.

so if you have interior scene, where PT will be horrible, this filter will generate artifacts due to bad variance/lack of correct lighting
I improved it to take into account more data, like shadows, specular, and some glossiness

2- https://mediatech.aalto.fi/publications/graphics/GPT/
this one gives less variance using less number of samples, using the gradient domain, but the cost of sample is about 2.5x of PT "it is doing the sample + other 4 samples in -x, +x, -y, +y), but the result is way better, so 1000 spp in GPT is better than 4000 spp in PT, so improvement as said in the paper is about 1.6x

3- http://jo.dreggn.org/home/, Johannes Hanika, Manifold Next Event Estimation (MNEE)
decent paper which suits very well to Cycles PT, as it will help to catch specular rays, so it will help a lot with all glass shaders, caustics, ...

4- https://disney-animation.s3.amazonaws.com/uploads/production/publication_asset/70/asset/Sorted_Deferred_Shading_For_Production_Path_Tracing.pdf
as the title says :) , and this type of work suits GPU as it contains a lot of sorting and coherence, but it needs very tight texture pixel usage "like PTex", I tested the main concept of this on PBRT2, it gave some improvements , but this is only a small improvement due to the usage of UV texturing, instead of sorting by PTex index, I was sorting by UV coordinate.

5- Embree
I check Cycles against Corona, Arnold, v-ray, ....
in a complex scene, all of the others are starting in around 6~8 seconds, Cycles goes to 18 seconds.

Such a discussions are better to happen outside of the patch review, because it's not really related on the code and will only confuse reviewers.


Some quick feedback, rest i'd prefer to happen in bf-cycles ML.

  1. Didn't check the paper in details yet, but which exact filtering you're referencing to? Since it's adaptive sampling technique it could only mean additional samples are not distributed in a noisy areas. Or are you referencing to an artifacts in the importance map?
  1. Gradient domain rendering is indeed something to consider to look into. It will also be interesting to investigate which filters we can exploit the gradient input for.
  1. That's quite specific case of participating media rendering, not sure it can be integrated in a really clean way. Maybe it could tho..
  1. Isn't it something which helps for out-of-core rendering due to basically reducing of re-loading texture cache and avoiding mutex locks? We don't do such kind of rendering or tile cache in Cycles, so might not be really relevant.
  1. It's not really clear what exactly means "starting in around 6-8 seconds". Starting to render? Preparation step in Cycles is not really as fast as it could be. It's also good to know which exact scene you've been rendering and how you set up your materials.

Maybe @Brecht Van Lommel (brecht) has some additional feedback to confirm/bust my understanding of the papers, but as said, prefer to switch to a ML discussion for such questions.

  1. I don't know if this is the paper that PRMan uses or if what they do has even been published anywhere. But certainly there is a lot of interesting research going on in this area and this is a great paper.
  1. Also a very good paper. The reconstruction artifacts look pretty bad to me though. I guess that's a big question with all these fancy reconstruction methods, how to make it conservative enough to avoid bad artifacts, while still keeping the speedup. I think the feedback loop with adapative sampling as in 1. is important for that.
  1. Great paper, but it's only for certain types of caustics, and it's only making rendering of some very difficult effects bearable, but still very slow. I think there's other rendering methods that will help Cycles users a lot more in typical scenes, so I wouldn't give this high priority. I imagine manifold exploration is quite tricky to implement for general scenes and difficult to understand for users when it fails.
  1. There some many things that can be optimized in Cycles in terms of being able to handle big scenes, this paper seems like a lot of complexity for both users and developers. For example it won't work well in interactive rendering. My opinion is still that the focus should be on making data structures more compact, doing good adaptive subdivision and making use of the OIIO texture cache (works mostly for OSL, not currently supported in SVM due to lack of derivatives).

If all that has been done and it's still not enough, only then should something like this be tried. But note that nearly every studio is doing ok without a sorting/coherence system like this, there's not going to be many Cycles users that do things at Disney scale.

  1. Not sure what the starting refers to, do you mean those other renderers have a faster BVH build?

about 5., yep, they got like 2.5x speed up in BVH build.

about 1., disney released the paper, and in PRMan they said they used the recent disney paper, so I think this is it.

I think 1 and 2 can be mixed!!, as they depend on the scene gradient data in general.

about 3. this is for luxury :D , may be an option for the user, it helps a lot to catch details, why I mentioned it? paper 1 is mainly used to significantly reduce samples, so instead of rendering 4k samples, you can be ok with 150~200 samples, so if these samples are catching details, they will solve a huge problem that the paper suffers "glossy/specular surfaces, caustic definitions"

about 4. it is all about texture access and ray intersection coherence, with Cycles, you are using many layers of textures using complex node setup, in interior scenes, Cycles is crawling "no irradiance, no lightcache/photon map, ..." , so to solve interior scenes "even using light portals" , you will need more than 2000 spp, and even 1 and 2 won't help much with this "high variance, etc.." , so the solution is to make texture access way faster., but this is a night mare as you said for the developer "to do something like this, CUDA kernel need to be split similar to OpenCL kernels, then do 2 sortings, 1 for intersections, 1 for BSDFs. "every ray bounce"

Embree is important here, it has kernels that improve hair intersection significantly on xeon devices "which got a wide vectors, suitable for GPU"

about @Sergey Sharybin (sergey) question for the used scene, it is just a product visualization scene, but the product got a few millions of triangles, other renderers initialize the whole scene including BVH in 6 seconds, Cycles takes 4 seconds in the scene data, then 12 seconds in the BVH creation "and I guess they don't use Morton dynamic BVH here"

also its build time is quite good, I know Cycles uses an old Embree kernel, but it needs an update "imagine a scene which renders in 30 minutes, will render in 10~20 minutes"

tell me where to discuss these stuff as I don't know :)
in my opinion, the above specified papers, 1 is the most significant, then 3, 5, 4, 2.

other than that, tell me what to do with the patch so I finalize it.

Regarding this patch, I guess it's your job to convince the Cycles developers that this sufficiently reduces noise in important cases to be worth the added code complexity. If that is clear, then someone can do a more detailed review of the code. Personally I'm not totally convinced that this should be added, the choice to go with filter importance sampling was not by accident, but I never did a detailed comparison so maybe I was wrong.

Currently the patch is not functioning correctly, it's rendering too blurry, so it's impossible to judge the usefulness.

From a quick glance at the patch it seems that the filter importance sampling code is still active, so it may be doing double filtering? I don't understand the logic in the code well. The filter_table is an inverted CDF for importance sampling, but you don't need that for what you're trying to do. Rather I would expect uniform sampling inside a pixel, and filter weights computed from evaluating the filter function, without any CDFs or inversion.

@Brecht Van Lommel (brecht), yea I know that, the above code is not the final, images here are from an updated code without MIS and CDFs, a different table.

this is a fast test, using Mitchell filter of width = 1, 64 spp

Thanks, but did you verify that it converges to the same result with many AA samples? It still looks more blurry to me.

Mohamed Sakr (msakr) added a comment.EditedJul 25 2015, 6:26 AM

not the exact same result "there is a difference because of the filter contributing weights to neighboring pixels", this is considered a bias, but the difference is smoother image.
let me explain the main difference:

considering an edge "like the hair" , where it is seen by some pixels, others it is not "it is very thin"

pixels: 0  1  0
        0  0  1
        0  0  1

with filter it will be:

pixels: 0.10 0.75 0.2
        0.05 0.25 0.9
        0.00 0.15 0.9

these values are just to view "they are totally wrong for sure" , so when you zoom in on the image in photoshop, you won't see it with a very harsh transition.
the error gets smaller with larger images "full HD, 4k, .."

here is a larger image "1600 x 1200", 1600 spp. I'm using 2 difference Mitchell filter widths "MIS of width 2, kernel filter of width 1", 1 image with kernel filter, 1 without.
the Hallway image is using the same concept, but only 64 spp, did this to show how in general it works "specular artifacts are spreading" , the good point: we can remove the specular component from the filter spread "so it affects only its own pixel, not the neighboring pixels"
you can check how edges are looking in both examples, edges with filter are looking much better, also noise is less in the filtered Hallway "regardless of the artifacts of Specular".

not the exact same result "there is a difference because of the filter contributing weights to neighboring pixels", this is considered a bias, but the difference is smoother image.

No, it should converge to exactly the same result because both methods get values from neighbouring pixels. The filter importance sampling will sample an area bigger than one pixel which is really the same thing, it just has different noise characteristics.

All this can be done unbiased, I don't think there's any inherent bias in using a pixel filter.

hmm, explain more, there is something that I'm missing here that I don't understand.
I think it is biased because when camera casts a ray at pixel (x, y), the contribution should be only for that pixel, in this filter, the contribution is done to all surrounding pixels too.

See figures 3, 4, 5 in the filter importance sampling paper: http://lgdv.cs.fau.de/get/785. I can't really explain it better than them.

We are trying to rendering an image with a particular pixel filter, which may be bigger than a pixel. If that's wrong or right is a matter of taste, it depends on what kind of camera you're trying to simulate, what kind of screen you will be displaying the image on, how soft you want things to look, etc.

But when a user says they want to use a pixel filter of some size, then we try to render that. So if the filter is bigger than a pixel we want to use samples from outside that pixel, that's not biased, it's what the user asked for.

When we talk about bias in rendering we usually talk about biased/unbiased estimators: https://en.wikipedia.org/wiki/Bias_of_an_estimator

In simple terms, we are doing unbiased rendering if we can take many renders with different seeds and average them, and then get the correct result. That works with unbiased path tracing, but for example not with irradiance caching. In this pixel filter case, both methods are unbiased if implemented correctly.

Mohamed Sakr (msakr) added a comment.EditedJul 25 2015, 2:32 PM

oh :D , for a long time, unbiased in my mind was = anything that uses connections (PT, BDPT, MLT, ..) with bruteforce, no smoothing.
so irradiance, lightcache, photonmapping, ... are the typical biased algorithms (they use interpolations, merge), as the results of them is just interpolations "sppm is consistent though, but still it is biased".

so I thought as we are averaging pixel contributions, then this is a bias :)
BTW, I guess it is still a bias, as it is considered "merge at camera point"

I checked the paper, thanks for the link!!, I'm wondering where do you check for raster_x and raster_y inside kernel_camera.h camera_sample(), from the paper it appears that samples can get generated anywhere "possibly outside the whole buffer".

I'm wondering where do you check for raster_x and raster_y inside kernel_camera.h camera_sample(), from the paper it appears that samples can get generated anywhere "possibly outside the whole buffer".

Yes, samples can be generated outside the buffer. I'm not sure what you mean by "checking" raster_x and raster_y. They are computed by importance sampling the filter function and can be arbitrarily far outside the pixel, which works fine without any further checks.

ah, I see, this is the benefit for MIS filter :) "writing to the current pixel"

so to finalize the code, I have to change a little in buffers "as described earlier, so buffer will be a little larger by a margin of pixels depending on filter size", this also will give a benefit later "when implementing image denoising algorithm from paper (1)", as it has to access/write to neighboring pixels as well.

about the code complexity, I will try to keep it as straight forward as possible, possibly with some #define macros.
for now, the kernel behaves correctly. will update and provide diff soon.