While investigating T82851: Undo bug while sculpting I realized that current general undo code has several issues, some being fairly critical and not trivial to solve (as is, too involved for a mere bugfix).
Here is current view of the situation (as I understand it), and some draft proposals to fix things and hopefully reach a better general state for undo code.
Current Status and Issues
1- Some undo steps will never be undone
Some undo modes (Sculpt and Image ones afaict) have a specific way to store and process their undo steps, which results in the fact that when undoing, they actually need to process the next step, and not the one that is passed to them (see sculpt_undosys_step_decode_undo e.g.).
This is all good as long as we stay in a same undo step type, but as soon as the step is from another type (like a global memfile one e.g.), its apply function is fully unaware of this, and therefore that first sculpt undo step never gets processed (undone).
2 - Some undo modes have useless redundant and critically buggy code
Again, Sculpt and Image undo code has loops over previous/next steps that are both useless, and critically buggy as they assume said previous/next steps are of the same undotype.
I'd assume this is inherited from the time where undo stacks where separated, but in a single common undo stack this is both useless (since main undo code like BKE_undosys_step_undo_with_data_ex already ensures that all previous/next steps have been processed in proper order), and obviously broken (due to lack of undo step type checking).
3 - Unclear half-not-implemented fully-not-working support of 'absolute' undo steps
Definition:
- A absolute step can be evaluated as its own and generate a valid data state, regardless of any other steps in-between it and the current data.
- A relative step depends on the neighbor steps' results, so going back to a specific step requires to sequentially apply all intermediary steps in-between it and the current data.
Currently, general undo system (in undo_system.c) assumes all steps to be relative (see e.g. the first loop in BKE_undosys_step_undo_with_data_ex). There are comments and pieces of code that seem to hint that absolute steps were supported, or intended to be supported at some point.
In addition, while memfile steps used to be absolute, it's no longer the case with new system for them either.
4 - Apparently unused GPencil undo code
GPencil undo code is tagged in ed_undo_step_impl as needing update to be properly integrated in new general undo stack, however it seems to never be used effectively. See T84703: Modernize (or Remove) GPencil Undo System for details.
...And more?
There are probably more issues more or less related to issues mentioned above (did not check our reports related to edits/memfile undos mismatches e.g.), but I think addressing those three points would already bring things to a saner state.
Proposals
1 - Some undo steps will never be undone
Think main undo system should deal with those cases. Would do it that way:
- All step_decode callbacks should only ever process the step that is given to them.
- We need a tag for undo types that require the step-after-target-one to be processed in undo case, so that main undo system knows how to deal with them, including within a mixed type of steps situation.
- E.g. when you undo from a sculpt step to a memfile one, you actually need to process both the sculpt undo step (exception case) and the memfile one.
2 - Some undo modes have useless redundant and critically buggy code
Those extra looping pieces of code should simply be removed imho. Afaict they are dead code from earlier versions of undo system?
Tried to get them actually do something in sculpt_undosys_step_decode_undo() in debug session, and in all cases those actually never do anything, besides setting step-to-be-processed as the next one instead of the target one e.g.
3 - Unclear half-not-implemented fully-not-working support of 'absolute' undo steps
Suggestion is to simply always assume all steps are relative ones. This will simplify logic and make code easier to follow.
This does imply some extra work when undoing a lot of steps at once (through the undo history menu e.g.), however I think this situation is not that common? And one could argue that on very heavy files like production ones, doing fifty relative undo steps affecting only a few data-blocks may very well be much quicker than re-reading the whole .blend file from memory, with all the implied depsgraph rebuild, updates, etc.