Page MenuHome

Add support for Zstd compression for .blend files
Needs ReviewPublic

Authored by Lukas Stockner (lukasstockner97) on Sep 15 2019, 5:10 AM.
Tags
None
Tokens
"Like" token, awarded by EAW."Love" token, awarded by SteffenD."Pterodactyl" token, awarded by Grische."Like" token, awarded by YAFU."Like" token, awarded by amonpaike.

Details

Summary

Compressing blendfiles can help save a lot of disk space, but the slowdown while loading and saving is a major annoyance.
Currently Blender uses Zlib (aka gzip aka Deflate) for compression, but there are now several more modern algorithms that outperform it in every way.

In this patch, I decided for Zstandard aka Zstd for several reasons:

  • It is widely supported, both in other programs and libraries as well as in general-purpose compression utilities on Unix
  • It is extremely flexible - spanning several orders of magnitude of compression speeds depending on the level setting.
  • It is pretty much on the Pareto frontier for all of its configurations (meaning that no other algorithm is both faster and more efficient).

One downside of course is that older versions of Blender will not be able to read these files, but one can always just re-save them without compression or run the file through unzstd.

ToDos:

  • Versioning when loading older files: Currently, the Compression Method setting will be set to None when loading a gzipped file from an unpatched Blender (the reverse works).
  • Compression Level setting: I don't know if we want to support this, but I can see uses (e.g. turning up the compression before sending the file to someone). As a workaround, right now you can just run the uncompressed blendfile though the standalone zstd at any setting and it will load.
  • MAYBE: Other algorithms? Something like LZ4 might even be interesting for in-memory undos.
  • Most of the build_files changes are untested - it builds on my Arch machine, that's all I can say so far
  • CMake scripts for manually/statically building libzstd are missing.

Benchmarks

Some benchmarks on a random massive file I had lying around:

MethodSize on DiskTime to loadTime to save
None863MB1.449 sec0.333 sec
Zlib476MB3.426 sec14.46 sec
Zstd455MB1.505 sec1.630 sec

All I/O is to/from ramdisk. The time to load also includes a few linked files iirc, so just look at the differences instead of the absolute times.

To validate that the implementation isn't bottlenecked somewhere, I ran the standalone zstd on the uncompressed file and it took 1.365sec, which matches almost exactly with the measurement difference above.

Diff Detail

Repository
rB Blender
Branch
zstd (branched from master)
Build Status
Buildable 10100
Build 10100: arc lint + arc unit

Event Timeline

Lukas Stockner (lukasstockner97) edited the summary of this revision. (Show Details)
Ray molenkamp (LazyDodo) requested changes to this revision.Sep 15 2019, 5:26 AM

how does this compare to D4402? I'll leave the merits of lz4 vs zstd to the other reviewers, however from a platform perspective the windows explorer thumbnail dll will have to be updated before this lands. (but i'd hold off any work on that until it's somewhat clear what direction we're planning to go with this)

This revision now requires changes to proceed.Sep 15 2019, 5:26 AM

Oh, well, I didn't know about that patch. I'll retest tomorrow with LZ4 included.

Regarding Thumbnails: Good point, same goes for some Python stuff I'd expect.

The patch looks good, hopefully it's not wasted effort, the LZ4 patch was listed here T63728: Data, Assets & I/O Module.

A few considerations:

  • My intention was that LZ4 could be enabled by default because saving can be faster when it's used (depending on the HDD speed). I read ZSTD at low compression levels approaches LZ4, so the same advantage may be true for ZSTD.
  • Currently there is an optimization in file reading that delays reading the data from linked libraries (added because Spring files were spiking memory on load). This isn't used for ZLIB files, since seeking is _very_ slow. It would be interesting to check how LZ4/ZSTD seek performance compares, if it's even practical to enable seek for either of the formats.

    If we can't use seeking, this is a down-side to enabling compression by default, at least for large projects that use linking. So it would be good to check if either compression can work well in this case.

    An alternative which allows seeking is to compress the content of BHead instead of the entire file.
  • I'd prefer not to have too many compression formats, for users having none/fast/best options is probably enough.

Comparison with LZ4 here (average of three runs):

AlgorithmSavingLoadingSize
Zstd -200.62sec1.12sec498MB
Zstd -100.65sec1.06sec488MB
Zstd -50.69sec1.07sec483MB
Zstd -30.75sec1.08sec479MB
Zstd -10.83sec1.13sec475MB
Zstd 01.77sec1.18sec455MB
Zstd 11.20sec1.20sec470MB
Zstd 21.33sec1.24sec464MB
Zstd 3 (default)1.76sec1.19sec455MB
Zstd 57.68sec1.19sec452MB
Zstd 1021.09sec1.21sec443MB
LZ40.81sec0.97sec491MB

Loading times aren't comparable to the original benchmark since I used userspace time instead of wall time.

Regarding seeking: I don't think either format supports it. However, we could eventually support compressing only the contents of each block as you suggested - that could even support selective compression (e.g. not recompressing packed images or not compressing the thumbnail to make that part easier). However, since that would need to store per-block compression flags, a modification to the BHead struct might be needed (maybe steal a byte from the SDNA index?)

Update on the per-block compression: It works, but the compression ratio suffers from the small blocks - the file from above is now 544MB on level 3. In theory dictionaries could help here, but I'm not sure how to implement that practically.

We could do some sort of chunking scheme that provides seekability on a granularity of a few MB, but that's not really practical either...

If it's already clear that Zstd is the format of choice, it might make sense to prioritize the reading part.

As far as I can see, just Zstd loading support on its own already has some benefits: It gives some additional compatibility (files from future versions with writing support could be loaded) and it allows advanced users to manually compress their files.

The open issues seem to be UI/options and thumbnails (or support in other .blend loaders in general, e.g. blendfile.py). Am I missing any here?

As for the seeking part: It doesn't seem like modern compression formats tend to support this. In the future, it might make sense to implement smarter approaches for this (e.g. chunking the file during writing and compressing chunks independently, storing big blobs such as packed images in own chunks and accumulating smaller BHeads until threshold is reached and then writing them in a chunk, or even trying to be smart and grouping related BHeads into chunks), but that requires significant and breaking changes to the file format that go way beyond the current compression wrapper around the basic format, so I think that it's best to leave it for the future (especially since the current compression also has this issue, so it's not a regression).

I looked into this some more and we might be able to make seeking work.

The key is the Zstd file format: It stores data as a series of frames, each of which is independently compressed. By default Zstd handles the splitting of data into frames by itself, but as far as I can see Blender could be smart about it and use one of the chunking methods describes above to divide the stream of BHeads into frames. In addition, we could even store metadata such as a table of per-frame offsets in meta-frames in the Zstd format. Naive decompressors would skip that and just concatenate the frames, which is fine and results in a valid .blend, but a smart decompressor (such as Blender's file reading logic) could skip unneeded frames.

This way, we do not need to modify the underlying .blend format at all, one could still use standalone zstd tools to compress/decompress .blends (important for third-party tools) and we still get seeking support.

I'll have a look at implementing this.

Some more thoughts on how to implement this.

As mentioned, a Zstd file is a sequence of frames. Each frame is either a compressed block of data or a "skippable frame" containing arbitrary data that generic decompressors ignore.

As far as I can see, the writing code just writes a series of blocks, each with a BHead and data. It already has logic for collecting data into chunks (for the undo system), so it's easy to use this for Zstd as well and to write one frame per chunk.
On the reading side, Blender reads all BHeads, but optionally skips reading the data until it is needed.

To implement support for properly seeking data in the compressed file, we can use something like the proposed seekable Zstd format: The idea there is to append a skippable frame to the file which includes a table of frames and their offsets in the compressed and uncompressed stream. When we want to read a specific part of the file, we use this table to look up which frame it is in, then decompress the frame and grab the data.

There are two problems with this:

  • First of all, of course we want to avoid decompressing frames multiple times. Therefore, it would make sense to cache them - that way, the frame is decompressed when we need it the first time, and any future accesses to data inside the same frame can be served from the cache. However, of course this increases memory use. For the special case of frames that only contain one block, we can free them from the cache once the data is read, but for frames containing multiple smaller blocks we need to keep them around.
  • In addition, since we read all BHeads, we would touch every single frame anyways. Therefore, I'd suggest keeping an additional copy of all BHeads in another skippable frame. The impact on file size shouldn't be bad since they compress fairly well. The alternative would be to read through the file once with disabled caching and then go through it again to get the needed data. Of course, this slows down loading times.

So, it's possible, but there are some issues:

  • Either we store some redundant data, or we need to decompress everything at least once even if we only want a tiny part of the file.
  • Either we have the memory overhead of the cache or we have to decompress things multiple times (a tradeoff would be to keep the last N frames in cache. Also, from what I understand this is mostly a concern when loading many linked files, so maybe dropping the cache after each file is enough to make this a non-issue?)

As mentioned, I'll look into implementing a version that supports either option for both cases (with or without extra BHead frame, with or without frame cache). I'm not sure how the access patterns look in practice, so being able to test it and see the impact should help.
In the meantime, does anyone see issues with this so far?

Updated version, now with seeking support.

I ended up trying a lot of different approaches for caching, prefetching and additional out-of-band information in the file, but in the end the version that worked best was
the simple one - reusing the write batching code in writefile.c in order to write independent compressed frames, then including seek table information in the file according
to the official format.

This relies on file accesses being mostly in order - if the reads just skip all over the place, it will end up re-decompressing frames a lot. However, in my testing it
generally decompresses each frame about twice - once for the initial BHead load and once for the deferred data, implying really consistent ordering in the deferred loads.
However, considering that writefile.c generally writes out data in a very similar order, that makes sense.

The problem with the approach of storing just a list of BHeads was that libraries refer to ID datablocks by name, so we have to read all those as well. Of course I could
also store a copy of those, but at some point it starts to get messy and to restrict future changes to the reading code.

This version performs surprisingly good - I tested it on the .blend files from the Code Quest USB stick (Daily Dweebs and the Eevee demos), and results were pretty good.
Peak memory usage is only a few MB over uncompressed files with loading times sometimes even being faster than uncompressed files (e.g. 4-5sec for Zstd vs. 11sec for
uncompressed with the Dweebs files). I have no idea how that's happening - as far as I can tell it's not a disk bottleneck since the files are in the system page cache and
the gap gets even worse if I drop them before testing (also, the files are on a NVMe SSD, so that really shouldn't matter). Maybe it's system call overhead because we are
making so many tiny reads?

For this version, I left out the changes to the compression option, now the checkbox just writes a Zstd file instead. I still think we should include more options
(Uncompressed, a few Zstd levels and Gzip for backwards compatibility), the question is how to make it user-friendly - maybe just show preset options in the UI, but
allow specifying the level through the operator arguments?

Here's the performance tables for reference (note that the last one was done in 2.90 while the other three were pre-rebase):

Save Time (s)NoneZlibZStandard 3ZStandard -1
Forest0.0311.8840.2620.102
Mr. Elephant0.0521.7110.4300.152
Spaceship0.0160.5380.1940.073
Temple0.1047.8460.6770.248
Wanderer0.1717.0211.2520.639
Wasp Bot0.0471.5430.4300.196
File Size (MB)NoneZlibZStandard 3ZStandard -1
Forest105716572
Mr. Elephant112666368
Spaceship37181821
Temple383350353355
Wanderer596299292311
Wasp Bot116585965
Dweebs1231423327415
Load Time (s)NoneZlibZStandard 3ZStandard -1
Forest0.1500.4360.1550.0604
Mr. Elephant0.1690.4130.1720.1102
Spaceship0.0360.1540.0670.0429
Temple0.2471.5630.2810.1732
Wanderer1.0571.6990.6810.4878
Wasp Bot0.0790.4040.1630.1055
Dweebs11.3506.4834.9044.1679
Peak mem (MB)NoneZlibZStandard 3ZStandard -1
Forest112216114114
Mr. Elephant122230124124
Spaceship45814646
Temple392773394394
Wanderer6331205635634
Wasp Bot123238125125
Dweebs1813270118171816

One more thing: Since we compress data in independent frames now, we can move the actual compression into background threads!

There's a bit of additional complexity since the frames have to be written in the correct order, but that is totally doable.

This provides a considerable speedup - for my test files, level 3 is now about as fast as level -1 used to be.

In theory, we could do something similar for reading - from an initial glance, frame accesses are sequential abount 75-80% of the time,
so we might speed things up by scheduling decompression of the next frame in a background thread and hoping that we'll end up using it.

Of course, having a single background thread only provides a speedup up to the minimum of decompression time and other reading logic.
Since decompression time is likely to be quite a bit larger, I'd expect this improvement to be minor. For real speedups we'd need
multiple background threads, but that requires more aggressive prefetching, which reduces hit rate, so the extra performance might
be wasted on reading data we don't want (yet). Therefore, I've skipped this one for now, we could always do it later.

Rebased to latest master, would be great to get some feedback and/or review on this.