Page MenuHome

Buildbot: Migrate package archive format for Linux from tar.bz2 to tar.xz
ClosedPublic

Authored by Jens (JRottm) on Oct 27 2019, 4:09 PM.

Details

Summary

xz compresses 25% better than bz2, reducing download times and server load. The numbers:

blender-2.80-linux-glibc217-x86_64.tar.bz2 (release): 134 886 174 bytes
with xz: 96 181 604 bytes (-28.7%)
with xz -9: 93 871 548 bytes (-30.4%)

blender-2.81-7c1fbe24ca33-linux-glibc217-x86_64.tar.bz2 (beta): 173 600 363 bytes
with xz: 133 100 664 bytes (-23.3%)
with xz -9: 129 534 124 bytes (-25.4%)

xz also decompresses more than twice as fast as bz2, however compression needs four times as long (on my 7-year-old laptop 3-4 minutes instead of <1).

Also xz has become more common than bz2, e.g. Debian/Ubuntu deb packages have been xz-compressed for years, so the dpkg package manager as well as systemd and grub all depend on liblzma being present, whereas bz2 is becoming more and more optional.

Current Linux archives also include the UID/GID of whatever user account happens to be used for building by the blender.org infrastructure. If someone then installs these archives as root e.g. to /usr/local/... and doesn't pay full attention the files remain owned by a regular user, which is a serious security issue. This patch fixes that by setting the UID/GID to 0.

Diff Detail

Repository
rB Blender

Event Timeline

Jens (JRottm) created this revision.Oct 27 2019, 4:09 PM
Jens (JRottm) retitled this revision from Migrate package archive format for Linux from tar.bz2 to tar.xz. to Buildbot: Migrate package archive format for Linux from tar.bz2 to tar.xz.
Jens (JRottm) edited the summary of this revision. (Show Details)

Prev. diff was against 2.80, sorry, now master. Dropped my comments from diff summary, now repeated here:

Someone already suggested xz a year ago on devtalk.blender.org, but was almost completely ignored. So I made this diff, hoping it will help attract some attention.

Note that I'm new to blender, I *really* hope I'm doing this the right way. I found 2 places in the source where a blender package is built, one in build_files/buildbot/slave_pack.py, the other in build_files/cmake/packaging.cmake. They seem to be independent of each other? I think the blender.org infrastructure uses the /buildbot/ one, don't know when/if the /cmake/ one gets called, but I changed both.

Someone with more insight will need to have a look. Don't rely on this being commit-ready, it is more of a feature/change request with a diff tacked on to it so that people see what I'm talking about.

Do reviewers add themselves or do I add them? There's a warning sign, so I guess I do that...

Have you tried recompressing Blender release tarballs to see how much space would be saved in practice?

Yes, the "25%" figure in the summary resulted from doing just that.

In that case, please update the description and make it explicit what you're basing your claims on. For example, "xz compresses 25% better than bz2" is a very generic claim and doesn't mean much. Something like "On average xz compresses 25% better than bz2 when recompressing all the .tar.bz2 files in the 2.77 - 2.80 release files" is more specific, and as a result carries more weight and shows that you're not just copying some data from some website.

Generally looks fine, suggested some edits.

We could use this for source archives too, see: ./build_files/utils/build_tgz.sh although it could be done as a separate commit.

build_files/buildbot/slave_pack.py
70–75

Brief comment for why this is needed would be good.

build_files/package_spec/build_archive.py
55

The tar I'm using v1.32, has support for xz.

archive_cmd = ['tar', 'cJf', package_archive, package_dir]

To include compression level an environment variable needs to be set:

diff --git a/build_files/package_spec/build_archive.py b/build_files/package_spec/build_archive.py
index 5ca2f319d87..d576bc8df53 100755
--- a/build_files/package_spec/build_archive.py
+++ b/build_files/package_spec/build_archive.py
@@ -49,15 +49,18 @@ try:
     if not os.path.exists(output_dir):
         os.mkdir(output_dir)
 
+    archive_env = None
+
     if extension == 'zip':
         archive_cmd = ['zip', '-9', '-r', package_archive, package_dir]
-    elif extension == 'tar.bz2':
-        archive_cmd = ['tar', 'cjf', package_archive, package_dir]
+    elif extension == 'tar.xz':
+        archive_cmd = ['tar', 'cJf', package_archive, package_dir]
+        archive_env = {'XZ_OPT': '-9'}
     else:
         sys.stderr.write('Unknown archive extension: ' + extension)
         sys.exit(-1)
 
-    subprocess.call(archive_cmd)
+    subprocess.call(archive_cmd, env=archive_env)
 except Exception as ex:
     sys.stderr.write('Failed to create package archive: ' + str(ex) + '\n')
     sys.exit(1)

How commonly xz decompressors are in modern desktop distros?
Using xz for sources should be all fine, but for nightly builds we need to stick to something which is available "out of the box". If xz is commonly available it is not a problem.

Jens (JRottm) edited the summary of this revision. (Show Details)Oct 29 2019, 11:41 AM

@Sergey Sharybin (sergey):
xz has become more common than bz2, deb packages are xz-compressed, see updated diff summary

@Campbell Barton (campbellbarton):

Brief comment for why this is needed

Agreed.

could use this for source archives

Yes, I thought so, too (gz->xz: 52->36 MB, -31%). However build_tgz.sh is only called by "make tgz" and I can't see _either_ being called by buildbot. I don't know how the blender.org website gets its compressed source, but it doesn't seem like it's through this code?

compression level ... environment variable

Don't you think your method is actually a lot more complex than what I did? (Note that you still need the --owner=0 --group=0 in any case.)

Think this is fairly common now, the latest CentOS has it for e.g. expect all other mainstream distros too.

compression level ... environment variable

Don't you think your method is actually a lot more complex than what I did? (Note that you still need the --owner=0 --group=0 in any case.)

Neither seem all that complicated, however since we've been using it's built-in bz2 support, it may avoid unnecessary problems not to change this when we don't need to (adding --owner=0 --group=0 seems fine).

Don't you think your method is actually a lot more complex than what I did?

I think Campbell's approach is indeed slightly more complex (it adds environment variable handling), but it does follow a more common approach of using tar's J argument to produce XZ-compressed files.

(Note that you still need the --owner=0 --group=0 in any case.)

That's a different thing, and could have been presented in a different patch. Those options have nothing to do with XZ vs. BZip2 compression, so adding them in the same commit may be confusing further down the line. Note that I don't voice any opinion here about having those options or not, just that it's something that deserves the proper attention and motivation, rather than squeezing it into this patch.

build_files/package_spec/build_archive.py
55

archive_env = {'XZ_OPT': '-9'} should likely be replaced with archive_env = {**os.environ, 'XZ_OPT': '-9'} to get the correct semantics. When a not-None environment is passed, it completely replaces the parent environment.

Jens (JRottm) marked an inline comment as done.EditedOct 29 2019, 6:49 PM

@Campbell Barton (campbellbarton):

since we've been using it's built-in bz2 support, it may avoid unnecessary problems not to change this when we don't need to

Very sorry, "avoid ... not to ... don't" - with 3 negations I'm unsure if I understand you correctly. Are you saying my way is ok, too, or are you saying do prefer the env var way?

You said something about "change" and "unnecessary problems". Yes, going from '-J' to '-I xz' is a change. Adding an environment to the subprocess when there hasn't been one before is also a change. I tested both, both work.

I checked tar's changelog: xz support and '-J' were introduced 03/2009, whereas '-I...' is already supported since 12/2008. I'll change it to the full option '--use-compress-program', which is supported since 03/1993 (and should even work on FreeBSD, which '-I' doesn't).

Note that both '-J' and '-I xz' invoke the same external process, tar does not contain any compression code nor link any compressor library, it always calls the external command line tool. (Check with "ldd $(which tar)", there is no liblzma.)

Finally, xz having an env var XZ_OPT to allow passion options is a rare thing. Googled the manpages of compress, gzip, bzip, bzip2, 7z, lzma, lz4 and zstd, none of them mention an env var like this. Only lzop has something comparable, so if the compression algorithm gets updated again, chances are the env trick won't work any more.

Replaced -I with more compatible --use-compress-program.

Jens (JRottm) added a comment.EditedOct 29 2019, 7:30 PM

more common approach ... J argument

Yes, -J is common, but XZ_OPT very uncommon, I'd never known of it until today and no other compressor except lzop supports an env var like this (see my previous comment), so this likely will no longer work with whatever succeeds xz. OTOH --use-compress-program is supported since 1993 and even works when tar doesn't even have its own option for the compressor in question.

{**os.environ ... None ... completely replaces the parent environment

Supports my complexity point. Also, what if parent env already has a XZ_OPT var, will it still work or will you get the var twice? (Sorry, first time I do anything with Python at all.)

--owner=0 --group=0 ... could have been presented in a different patch

Yes, very valid point. I was unsure about that one myself, but eventually decided to put both in 1 patch, because both affect the same 1 line of code, and I thought 2 patches might overcomplicate matters.

You wrote "could have", which is not a "please do". If you _do_ want me to split the patches then I will.

build_files/package_spec/build_archive.py
55

This was intentional, I didn't think it was useful/important to pass any other environment variables to tar, OTOH, it's using them at the moment.

Updates patch:

diff --git a/build_files/package_spec/build_archive.py b/build_files/package_spec/build_archive.py
index 5ca2f319d87..d576bc8df53 100755
--- a/build_files/package_spec/build_archive.py
+++ b/build_files/package_spec/build_archive.py
@@ -49,15 +49,18 @@ try:
     if not os.path.exists(output_dir):
         os.mkdir(output_dir)
 
+    archive_env = os.environ.copy()
+
     if extension == 'zip':
         archive_cmd = ['zip', '-9', '-r', package_archive, package_dir]
-    elif extension == 'tar.bz2':
-        archive_cmd = ['tar', 'cjf', package_archive, package_dir]
+    elif extension == 'tar.xz':
+        archive_cmd = ['tar', 'cJf', package_archive, package_dir]
+        archive_env['XZ_OPT'] = '-9'
     else:
         sys.stderr.write('Unknown archive extension: ' + extension)
         sys.exit(-1)
 
-    subprocess.call(archive_cmd)
+    subprocess.call(archive_cmd, env=archive_env)
 except Exception as ex:
     sys.stderr.write('Failed to create package archive: ' + str(ex) + '\n')
     sys.exit(1)

Think this patch is close to being ready.

As for XZ_OPT / --use-compress-program, I wasn't sure if the invocation method increased risk of failure (broken pipe, argument quoting, errors not being properly forwarded - typical external process issues).

Seems it's mostly personal preference so no strong opinion.

build_files/buildbot/slave_pack.py
80

pass preset=9 here for high compression.

Added preset=9, thanks campbellbarton.

Jens (JRottm) marked an inline comment as done.Oct 31 2019, 5:07 PM

--use-compress-program ... increased risk of failure ... typical external process issues

I do understand this, however in this case you get the exact same external process either way, because tar doesn't contain nor link any compression code, it always calls the external tool in a pipe, see:

$ tar -cJ blender >/dev/null & sleep 1; ps -Hf
UID        PID  PPID  C STIME TTY          TIME CMD
jens      4653  4651  0 15:22 pts/1    00:00:00 bash -rcfile .bashrc
jens      5881  4653  0 16:14 pts/1    00:00:00   tar -cJ blender
jens      5883  5881  0 16:14 pts/1    00:00:00     /bin/sh -c xz
jens      5884  5883 99 16:14 pts/1    00:00:00       xz
jens      5885  4653  0 16:14 pts/1    00:00:00   ps -Hf

$ tar -cI xz blender >/dev/null & sleep 1; ps -Hf
UID        PID  PPID  C STIME TTY          TIME CMD
jens      4653  4651  0 15:22 pts/1    00:00:00 bash -rcfile .bashrc
jens      5886  4653  0 16:15 pts/1    00:00:00   tar -cI xz blender
jens      5888  5886  0 16:15 pts/1    00:00:00     /bin/sh -c xz
jens      5889  5888 99 16:15 pts/1    00:00:00       xz
jens      5890  4653  0 16:15 pts/1    00:00:00   ps -Hf

Same xz process whether you do -J or -I xz, buildbot has always been calling this external process.

So you actually don't avoid any problems with XZ_OPTS, but you introduce a lot of subtleties regarding the environment, the various revisions of the XZ_OPTS code in the past days are testament of that. And you change behaviour for MacOS's zip case, something I didn't want to touch.
Plus, as I already said, almost no other compressor has an env var for options, so if someone e.g. decides to switch to zstd in 5 years they'd have no choice but to switch to --use-compress-program anyway. It's the most compatible way.

Anything else I need to do to move forward? I think I've answered all questions?

If there is a way which avoids use of environment variable i'd choose that way.

I've added a server side change to support .xz archives on buildbot (mainly made it so older builds removal will work properly).

@Campbell Barton (campbellbarton), think you're most familiar with this change now. Mind taking care of moving forward here? I'm stuck in pile of other changes to the buildbot currently.

Campbell Barton (campbellbarton) accepted this revision.EditedNov 8 2019, 9:32 AM

While I don't have the build-bot setup - as far as I'm concerned this patch can be applied as-is.

This revision is now accepted and ready to land.Nov 8 2019, 9:32 AM

ping?

Is there still anything I'm expected to do? Any way I can assist?