Page MenuHome

VSE timeline manipulation API
Open, NormalPublic

Description

This is a draft, any idea or suggestion is welcome.
To simplify VSE code and make new feature-friendly framework I would propose a following set of changes:

Once Troy Sobotka suggested implementing OTIO. If it does 90% of what this should do, it may be worth trying to implement. Although even custom API should be doable under 1-2K LOC.

Metadata / timeline separation

Each sequence struct contains metadata mixed with timeline data.
Metadata would be for example video stream length, framerate, format, ...
Timeline data would be position on timeline, length on timeline, framerate on timeline, ...

  • On file load metadata are read, and timeline data are calculated once, by _arbitrary_ algorithm.
  • Any change in metadata(replacing file) must not result in any change of timeline data immediately.
  • Timeline data may have it's own struct, that will be consistent(as much as possible) across all sequence types.

Timeline API

  • This API should provide atomic functions for moving strip around the timeline. These are the only functions that may be/contain sequence type specific code.
  • Other functions built on top of these atomic functions can also be provided.
  • Functions should be as universal as possible.
  • Developer should be able to manipulate with strip on higher level, without knowing anything about strip.
    • (Move strip by offset, move with handles, add n still frames...)
  • Developer should also be able access timeline data directly, for features, that aren't covered by API functions.
  • In some cases it may be beneficial to use other units, than number of frames. It would be nice to create unitless code, but this is really optional.
  • Partial frames and custom playback speeds must be supported. This is the main reason why this change is needed. Support in sequencer itself is separate task.
  • To support custom playback rates, when recalculating offsets, sequence metadata are needed. Metadata must never be changed, unless user changes file or overrides some property. Metadata must always be referenced and never assumed / stored as timeline data.

Partial frames

Multiple sequences can have different frame/sample rates. In order to be able to do precise cuts, we need to support partial frames.

For sequencer this means to be able to set playhead in between 2 frame boundaries, and displaying this position sensibly. Visually and numerically.

For timeline API this would mean to be able to store positions and offsets with greater resolution, then its framerate(there may be beneficial to use different units) and report, where next(nearest) partial frame is to sequencer, when partial frame position is requested.
Open question is, if we allow strips to start at partial frames, or only end at partial frames. Personally I would not allow to start at partial.
Next open question would be maximum resolution. It may be beneficial to allow resolution up to sample rate of sound strip, but only for sound editing. This would allow us to edit "pops". Or maybe to cut sound, when "crossing zero" for preventing "pops", but in this case it may be better to create operator to add quick fade ins and fade outs. In audio editing it is best to do cut at "zero cross" and add 20-100 sample long fade.

I would recommend to implement option for this quick fade for @Joerg Mueller (nexyon) if not implemented yet.
Also question for @Joerg Mueller (nexyon): can you support such precise cuts, If we define exact behavior of timeline?

Playback speed animation

This is more a TODO for very adventurous developer :)

In sequencer, visualization of such events should be "pre-animated" eg. when animating sound pitch, rather then show changes in waveform in real time, waveform should be drawn pre-stretched locally to correspond playback.
Speed effect currently doesn't support animation of playback speed, but it should. this should be shown in timeline by drawing curve on strip, or by "frame markers" spreading apart/together or both.

For timeline API this means basically the same - calculation of correct length and position of individual frames.
Both cases will require "simulated run" of F-curves. This would correspond to doing partial integral arithmetically, if not in F-curves API, may be added.

Links to relevant tasks / discussions:

T44808: User can not hear sound when drilled into a meta strip in the VSE
T53615: Import EDL bugs

Details

Type
Design

Event Timeline

Richard Antalik (ISS) triaged this task as Normal priority.

I'm missing a problem statement, an explanation of what this is intended to solve. Which advantages would this provide to end users? Or would it mainly be to improve the internal code and Python API?

It's also not clear how OpenTimelineIO relates, as I understand it that would primarily be useful as an interchange format and the sequencer would still store data in .blend files?

It is TODO.
something like T56950, but it's single task.
I wanted to write this down, to set some vision, so developers can work on this problem, and have some kind of "checklist" of what should be done.

Probably could do this on devtalk, but wanted to try some phabricator features

It's fine to put technical design docs like this here.

It just helps other people reading understand things, if there is a clear problem statement and intended effect for end users.

custom playback speeds must be supported. This is the main reason why this change is needed. Support in sequencer itself is separate task.

Being unable to run through footage faster than normal speed is one of the big bottlenecks with video editing in Blender. You use that all the time as a video editor, to save time. To the point we currently use a hack to skip every third or every other frame in our add-on to save video editing time (we edit all our videos with Blender here)

For audio as well being able to cut between frames would allow for more precise cuts. These are the main two end-user issues I see the ability to work with sub-frames solve.

I’m with @Brecht Van Lommel (brecht) on the subject; random concepts makes for random design. Focus on the needs of the open movies to guide the nature of the design.

Regarding OTIO, it would be prudent for Blender to maintain a schema internally, which could interact with other modules. For example, the ability to reconstruct shot sequences in the compositor for various effects sequences in separate Blender projects would require this. Leaning on OTIO for consideration of what details might be in such a schema is probably a reasonable entry point. This would also make it feasible then, to serialize the internal schema to OTIO for taking things into and out of conforming from within Blender.

There is still the most glaring showstopper with regards to open movies, and it is front and centre for Spring; the offline / online handling of the new caching system needs to take into account the pixel management so that projects like Spring can conform correctly. Currently the VSE is quite useless for this, with the entire system being somewhat hacked nonlinear if you look at the code. Properly sorting this out as a high up the food chain issue would open the potential to for Spring to be conformed and graded under Blender. This in turn means having that cache support the offline / online approach at the buffer level.

Tricky stuff!

Ok, let me add my 2 - or more like 1000 - cents here. TL;DR: I list 7 major problems with Blender in relation to audio.

I've long been of the opinion that the VSE needs a complete redesign and rewrite from scratch. The messy code that is all over the place and the weird handling of sequence starts/lengths/ends with the various offsets that I don't fully understand up to today are major reasons for that. I just think that refactoring in this case would cost more time than starting from scratch. I could be wrong though. It certainly seems that this proposal is trying to go into this direction partly addressing these major problems. So in this sense I have to disagree with @Brecht Van Lommel (brecht) and @Troy Sobotka (sobotka): the VSE code is broken to the extent that I don't think its development should be user or open movie project focues - it first needs to be fixed for developers so that they can work properly with it.

With this intro, I'd like to use this place to detail some of the audio-related issues with the VSE that will also answer the questions put forward to me and issues mentioned in the user feature requests stated in some of the links provided by @Peter Fog (tintwotin). If there's a better place to put this "rant", please let me know. The following list will detail some design decisions that I had to make when implementing the audio system with implications - stuff that can be done and stuff that gets really hard because of these design decisions. I'm aware that Blender is not an audio application but a graphics application. It makes sense that its not designed for audio but I still hope that a redesign of the VSE can lead to a design that has a better support for audio and issues arising with it.

Let's start with animation: Blender allows to _animate everything_ which is a really nice conecept. There are extensive possibilities to do animations: with f-curves and then modifiers on top of those for example. Video runs with 24 to 240 fps so evaluating the animation system that often is not that big of a deal. However with audio, the sample rate is usually 48 kHz and above. Calling the animation system evaluation functions for many different properties at this rate is just too slow for real-time audio processing. This is the reason why I had to introduce an animation caching system in the audio library. The cache of course has more disadvantages. Properties that can be animated in the audio system are volume, panning, pitch, 3D location and orientation (of audio sources and the "camera"/listener) and the cache is updated on every frame change which can be too late during playback. Thus, it can easily run out of sync with what the animation system knows - slightly moving a single handle of the f-curve has consequences for multiple frames, but the cache is only updated for the current frame, since always updating the whole timeline is too slow. To be able to do the latter, the user is provided with an "Update Animation Cache" button which is not only an ugly hack but is also horrible UI design since users don't want to and shouldn't have to care about this at all. Another disadvantage of the cache is that since it's storing values, it has to sample the animation data at some sampling rate and later reconstructs it with linear interpolation. This sampling rate currently is the frame-rate (fps) of the scene. I would have preferred to do it based on an absolute time value but when the frame-rate is changed in Blender all animation stays frame based and is thus moved to different time points. Another detail: during audio mixing the animation cache is evaluated once per mixing buffer: you can set the size of the buffer for playback in the blender user settings.

A related problem is pitch animation. Animating the pitch equals the animation of the playback speed of the audio. The audio system supports arbitrary pitch changes just fine - it has a very high quality resampler that easily deals with this since that is also required for the doppler effect. The problem with pitch animation arises with seeking. To do proper seeking of pitch animated audio, you would basically need to integrate the playback speed from the start of the audio file. Furthermore, to be really precise this needs to be done with exactly the same sampling that the playback will later do. Since this is a huge effort, it's currently simply not done. Seeking within a pitch animated sequence simply assumes the current pitch is constant and this will of course end up in the wrong position. Users can only hear the correct audio if they start playback at the beginning of the strip. Similar problems by the way also arise if you try to render the waveform of a pitch animated sequence.

Talking about being precise naturally leads to the next problem: sample exact cuts. During playback it is not enough if the animation system would simply tell the audio system: "Hey, this strip has to start playing back now." This works for rendering since the code runs synchronously with the rendering. But for audio you have to run asynchronously or you'll get either a lot of stuttering, clicking and pops, or wrong timing that is different every time you play back. The consequence of this was that I implemented a duplicate of the sequencing inside the audio system which needs to be kept up to date with what VSE knows such as strip starts, offsets and lengths. With this data structure the playback of a strip can then start precisely at the position it has to. To answer your questions @Richard Antalik (ISS): the values are stored as floats and therefore support sub-frame precision. This could also be used to cut to "zero cross", which is not implemented yet though.

Actually, there is another reason for the duplication of the sequencing system within the audio system: scene strip and simultaneous playback and mixdown support. The audio system actually manages the sequencing data separted into two types of data structures. Ones that are responsible for the actual data that is also available in the VSE such as strip starts, offsets and lengths. And the others store data that is required for playback such as handles to audio files and the current playback position within these files together with a link to the other data structures. The latter can then exist multiple times in order to support playback at the same time as mixdown and scene strips. I actually wonder how this is done for video files since those have similar problems - is there just one video file handle that is always seeked to the position where the current process/strip is located? If you don't understand this problem, think of this example: you have a simple scene with just one audio or video strip. In another scene you add a scene strip of this scene twice, overlapping and with a small offset. Now during playback, you have to read this one audio/video strip for both scene strips at different time points. That works badly if you have to seek within the file constantly.

A consequence of this separation into meta-information and playback-information is that dynamic processing is not really supported in this design. Consequently it is difficult to implement something like audio meters/VU meters that dynamically measure the playback audio. Basically the whole API of the audio system is quite high level and based on the separation of meta-information and playback-information. You can for example put together your signal processing graph (meta-information) and fire it off for playback (playback-information). You can fire it multiple times but once started you can't change the playback-information anymore. This also makes dynamically adding/removing effects impossible.

Speaking of effects, this is also something that's missing yet from the VSE and would be nice to have: effects and mixing/blending for audio strips. This is not really a limitation of the audio system. As I just mentioned in the last paragraph, the audio system supports putting together a signal processing graph. The reason is more that I basically wanted to touch as little code of the VSE as possible and adding the handling of audio effects to the VSE is pretty much the opposite of that.

This brings me to my last point which is the integration into the rest of Blender. The VSE supports the output of the renderer and compositor as strips but appart from that its quite disconnected from the remaining functionality of Blender. For example, for video I imagine that quite a lot of stuff could be done with either the compositor or the VSE but those two don't really work together. They are pretty disconnected and effects that are supported by both probably have a lot of kind of duplicated but still different code. It would be much nicer if the different parts and functionalities would work better together, i.e. video editor, compositor and VSE. The audiosystem was originally designed to support all audio needs of Blender. This originally was the VSE and the game engine. Then 3D audio objects were added as a third major pilar. Now that the game engine has been removed, two remain. The 3D audio objects are working with the same sequencing system that the VSE works with, they are part of the same data structures. But this is not reflected in the user interface at all. On one hand, the timeline editing of the 3D audio objects is very cumbersome. It would be much nicer if you could edit their playback like strips in the VSE. On the other hand, the VSE strips would support 3D positioning but again there is no UI for that. In the data structures it's the same.

I think I've covered the major issues. I have some ideas on how these problems could be solved, but the text is long enough already and I would really like to know what you think about these problems and how they could be solved without influencing you with my ideas.

I've long been of the opinion that the VSE needs a complete redesign and rewrite from scratch.

Bingo. 110% agreement. A design specification as per the various touch points is the first part here, with your knowledge on the audio side being huge.

since always updating the whole timeline is too slow

To be able to do the latter, the user is provided with an "Update Animation Cache" button which is not only an ugly hack but is also horrible UI design since users don't want to and shouldn't have to care about this at all.

This is precisely the same issue between an offline / online states. Perhaps we can roll audio into this offline / online state and have background rendering cache with audio as well? Seems like a perfect fit in truth

To do proper seeking of pitch animated audio, you would basically need to integrate the playback speed from the start of the audio file.

Pardon my ignorance, but is there an encoding that allows for background rendering of audio while also supporting arbitrary seeking? That is, part of the colossal problem with supporting (or entirely not supporting) codecs is that they are awful for work, which is why most nonlinear editing subsystems take a limited number of ingestion codecs to an intermediate “in system” variant. With Blender, this essentially means hitting the design needs, which would be a half float internal. Is there such an encoding for audio?

is there just one video file handle that is always seeked to the position where the current process/strip is located?

Last I looked, it relied on the garbage FFMPEG handle system. Having kicked it around a bit, the wiser approach would be a convert and store on load, then jettison the FFMPEG componentry.

Cut the garbage codecs down to a manageable subset of maybe two for ingestion, and assert they work properly within the internal design. I can’t count the number of hours I have logged helping folks out with broken ideas of codecs in Blender, but it is likely above a hundred.

For example, for video I imagine that quite a lot of stuff could be done with either the compositor or the VSE but those two don't really work together.

Nor should they. The idea of the VSE as an isolated component application though, is not helpful for Blender. For example, if we specifically look at the Spring project, a Strip / Shot View would be a more useful design paradigm within Blender; a time view of the sequence in question, with various assets linked. That is, closer to Hiero or now how Resolve has integrated a Shot / Strip View.

Doing effects in an NLE is just plain stupid, and doesn’t work towards the goal of projects such as Spring, where a Hiero / Resolve strip view would function extremely usefully for:

  • Logging (Audio)
  • Editorial (Audio)
  • Review (see Shotgun) (Audio)
  • Visual Effects View (after editorial, for conforming assets)
  • Grading (critical for Spring here)
  • Final mastering / conforming / audio muxing (Audio)

Each of those phases are utterly discrete with unique needs, and the VSE currently fails miserably at all of them. Hence the redesign idea is 110% what is required, to give the next design a solid chance of getting Blender to the point where it can fully deliver an Open Movie as a baseline target.

I see no good reason for a rewrite from scratch, that almost always goes nowhere. And there is no one available to do it anyway, as far as I know.

Design and implementation can be fundamentally altered in incremental steps, which has a much higher change of actually benefiting users. If the code ends up being completely rewritten at the end of many steps, that's fine, but don't plan based on that.

This design task was created by @Richard Antalik (ISS) with practical ideas to tackle specific problems, let's not go into too many different topics here.

Ok @Brecht Van Lommel (brecht). Since I'm not really a NLE guy, but rather a real-time guy, I'm also lacking some knowledge about that @Troy Sobotka (sobotka). If you have any resources on what a proper NLE would be (doesn't have to be audio specific), that would help. Also it might be interesting to talk to the actual Spring team in terms of what they would like or how it should work. Where/how should we continue this discussion?

I wrote this, as a reference for contributors, that would like to work on this.

Reason was that @Olly Funkster (Funkster) submitted patch to introduce support for using inputs with different FPS rates.
Such patch would introduce various workarounds that would have to be tested and that is wasted effort.

In terms of specification for proper NLE, I say: user can without unnecessary steps add source material, do edits with *expected* results, and preview (parts of) timeline :)

Whatever we will agree on, all subsystems should ideally use identical(copy) database and libraries to calculate stuff - be consistent in two words.

Discussion can take place, but better to be archived at visible place for who knows how long :)

@Joerg Mueller (nexyon), if you want to make proposals for redesigning other parts of the sequencer, you can create other tasks here or perhaps start a topic on devtalk if it's more for general discussion without concrete plans.

As far as I know the main issue for the Spring team is poor playback and rendering performance. Being able to do color grading efficiently would be good too.