Page MenuHome

Fix T74067: Crash when the UI accesses stale data
Needs ReviewPublic

Authored by Campbell Barton (campbellbarton) on Wed, May 20, 6:26 AM.

Details

Summary

The report T74067 shows how the interface code can operator on stale data when buttons activate immediately after handling an undo event.

This is an experimental patch that changes event handling so operators always finish handling events so the user interface doesn't end up operating on freed/invalid.

This has the down-side that events are more likely to build up in the queue.

To mitigate this, the flag OPTYPE_REFRESH_BYPASS has been added so operators which are likely to run on key-repeat such as text entry or frame-change can handle multiple events without a full refresh.

Note that I'm not all that satisfied with this solution, changing all operators event handling behavior to account for an extremely rare bug. Having said this, I'm not sure of a better alternative.


Open Topics:

  • Could we avoid unnecessarily updates from early exiting the event queue by detecting which operators remove data?

    For ID's this is straightforward, for non-ID data this gets difficult, unless we rely on the operators themselves to return this information.
  • We could have a different solution that avoids existing the event loop early and instead only refreshes the UI.

    I'm wary of this as it means UI drawing would have two different code-paths which create the layout, adding some potential for small differences between being constructed in the main event loop and while handling the event queue - which could lead to bugs.

Diff Detail

Repository
rB Blender
Branch
TEMP-EVENT-HANDLE-POSTPONE-AFTER-OPERATORS-v2 (branched from master)
Build Status
Buildable 8146
Build 8146: arc lint + arc unit

Event Timeline

Campbell Barton (campbellbarton) requested review of this revision.Wed, May 20, 6:26 AM
Campbell Barton (campbellbarton) created this revision.
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)
Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

We could try to be smarter about when to break out of the queue, and only do it when there is an active UI button or a menu is open. It would solve T74067 I think? Not sure if there are other cases.

The performance impact of this is a problem I think. I guess it's fine in most cases, but then it might lead to subtle slowdowns that we never detect. Thinking of cases like:

  • Typing numeric input for transform of a heavy mesh
  • Mouse button down + mouse move when starting to paint a stroke
  • Timer events
  • Fake mouse move events to force a refresh
  • Window events
  • Modal operators in add-ons that always run

I'm not sure if these are an actual problem, I haven't tested.

Regarding only refreshing the UI. I think it is doable to pull the UI refresh part (ED_region_layout) out of wm_draw_update(), and use the same function for both cases. to avoid code duplication. However that refresh can also be slow, so we'd probably only refresh the active region. And also the dependency graph since UI drawing can use. And maybe notifiers? It indeed gets messy to have a special code path for this.

It seems to me that we keep reinventing new solutions to the same core problem: changing context, UI or non-UI data from within the handling loop.
E.g.: UI after-funcs, wm_exit_schedule_delayed(), RNA UI context changes like rna_Window_workspace_set()/rna_Window_workspace_update(), UI context changes through operators (e.g. see ND_LAYOUTBROWSE, or ND_LAYOUTDELETE), bScreen.skip_handling (an old hack of mine...). There are probably more.
This case could too be fixed by delaying execution of the undo operator (which could mean executing handlers our of order - maybe that's fine, but it could be prevented too).


So it seems to me we could use a more general way to do delayed execution for some handlers. I.e. a version of the after-funcs at WM-handlers level. They could be executed after the regular handlers loop, in a more sandboxed manner.
Two possible approaches for this:

  • Handlers could register a callback (e.g. WM_afterfunc_register()). RNA setters could call this too.
  • Handlers could get flagged as "destructive", so they automatically execute delayed. E.g. OP-types could have a flag for this. RNA properties could have such a flag too, which the UI code could check and somehow cause a further delayed execution of its UI after-funcs (e.g. through a new handler with the destructive flag set).

I see some details to figure out, but that's the gist of it.

Note that I'm not proposing huge changes. Whatever we do should be doable without big changes.

A user pressing mouse buttons and keys in a particular order should always give the same result. If the outcome depends on how fast Blender can execute operators or redraw, the UI becomes unreliable in heavy scenes. I don't think events should ever be handled out of order.

I've actually wanted to get rid of special notifiers like ND_LAYOUTBROWSE and ND_LAYOUTDELETE. It's really hard to ensure such delayed handling is actually correct, since any data involved may be changed or deleted in the meantime. It also doesn't work with Python scripting.

We could try to be smarter about when to break out of the queue, and only do it when there is an active UI button or a menu is open. It would solve T74067 I think? Not sure if there are other cases.

It's always possible that some event activates a button before redrawing, this would need to be double checked.

The performance impact of this is a problem I think. I guess it's fine in most cases, but then it might lead to subtle slowdowns that we never detect. Thinking of cases like:

  • Typing numeric input for transform of a heavy mesh
  • Mouse button down + mouse move when starting to paint a stroke
  • Timer events
  • Fake mouse move events to force a refresh
  • Window events
  • Modal operators in add-ons that always run

I'm not sure if these are an actual problem, I haven't tested.

This patch doesn't impact modal operators, window events, timers ... etc, only operators that invoke/exec & finish.

This means a modal operator _could_ remove a data-block, which the user then access in the UI, in practice modal operators tend not to remove data-blocks.

The kind of lag you experience tend to be for holding keys where each key-stroke runs an operator.

For the text editor this could be an issue, I tested a Python template duplicated to 10k lines, on my system this works acceptably, for 20k lines the lag is noticeable if you hold a key down.

The only case I found which was obviously a problem was changing frames, as you're likely to hold the left/right key to change the frame, and lag in this case is noticeable.

We could reverse logic with a flag for operators such as frame-change and text input not require a full refresh after executing (edit, the patch has been updated to work this way).

It seems to me that we keep reinventing new solutions to the same core problem: changing context, UI or non-UI data from within the handling loop.
E.g.: UI after-funcs, wm_exit_schedule_delayed(), RNA UI context changes like rna_Window_workspace_set()/rna_Window_workspace_update(), UI context changes through operators (e.g. see ND_LAYOUTBROWSE, or ND_LAYOUTDELETE), bScreen.skip_handling (an old hack of mine...). There are probably more.

Indeed, it would be nice to consolidate solutions.

This case could too be fixed by delaying execution of the undo operator (which could mean executing handlers our of order - maybe that's fine, but it could be prevented too).

I'd be wary of out-of-order execution, if it was done it would need to be some isolated case we can be sure is supported.


So it seems to me we could use a more general way to do delayed execution for some handlers. I.e. a version of the after-funcs at WM-handlers level. They could be executed after the regular handlers loop, in a more sandboxed manner.
Two possible approaches for this:

  • Handlers could register a callback (e.g. WM_afterfunc_register()). RNA setters could call this too.

I'm not sure how this resolves T74067?

  • Handlers could get flagged as "destructive", so they automatically execute delayed. E.g. OP-types could have a flag for this. RNA properties could have such a flag too, which the UI code could check and somehow cause a further delayed execution of its UI after-funcs (e.g. through a new handler with the destructive flag set).

This would be similar to this patch, we just make operators responsible for letting Blender know when data was removed.
It can work, and in practice maybe it's the best option. It's the kind of flag that will easily get ignored and add-on developers, likely even our own operators won't set it correctly.

I was thinking we could have a dirty-state which could be set by functions that free ID's modifiers... etc.
At least we could then only do full refreshes in cases where there it might be needed, although even this difficult to do in a manageable way, and still requires manual tagging.

Campbell Barton (campbellbarton) edited the summary of this revision. (Show Details)

Add flag for some operators not to exit the queue after executing (text editing & frame change).

Another case where delayed execution is proposed: D7812.

I too am skeptical about out-of-order executing. But I do think we can avoid it and still do something like delayed execution (explained in follow up comment).

  • Handlers could register a callback (e.g. WM_afterfunc_register()). RNA setters could call this too.

I'm not sure how this resolves T74067?

The undo operator would trigger the actual undo step through such an after-function (maybe a better name is WM_sandboxed_func_register()). That way a mouse move event would never be handled after undo and before the UI was redrawn.
Maybe I understood the precise chain of events leading to this issue wrong though. I didn't investigate the cause too deep.

This would be similar to this patch, we just make operators responsible for letting Blender know when data was removed.
It can work, and in practice maybe it's the best option. It's the kind of flag that will easily get ignored and add-on developers, likely even our own operators won't set it correctly.

But is this really a common enough issue that justifies such a drastic default behavior change? I rather only apply this special handling when there are actual issues to address.

I was thinking we could have a dirty-state which could be set by functions that free ID's modifiers... etc.
At least we could then only do full refreshes in cases where there it might be needed, although even this difficult to do in a manageable way, and still requires manual tagging.

It's indeed difficult. I guess doing it in RNA would go a long way?


I think one issue with early-breaking the event queue is that wm_window_process_events() sleeps for 5ms if there's no OS, timer or XR event added. So we'd have to check if all queues are empty there.
That may be what makes the lag noticeable in fact.

Giving this some more thought, I think the following can work for sandboxed execution (it's not out-of-order, so the term "delayed" is misleading):

  1. Handlers can register callbacks for sandboxed execution (WM_sandboxed_func_register()). Could also be a handler/OP flag or mentioned dirty-state tag.
  2. Once event code recognizes that, event handling is early-exited. If needed the callback or handler can be executed in protected context then.
  3. Main loop continues execution, refreshes the UI. Drawing may be skipped if there are pending events in window queues.
  4. On next main-loop iteration, remaining events are handled.

This effectively does what's been requested earlier. It pauses event handling, refreshes the UI (calls ED_region_layout() for affected regions) and eventually resumes event handling. However with this proposal it's done by running through a regular main-loop iteration, without pulling refresh code into event handling code.