Page MenuHome

Automated performance testing
Confirmed, NormalPublicTO DO

Description

There are multiple performance-related projects this year (T73359, T68908, T73360, Cycles). For this we need tools to measure performance for individual changes, and to automatically track performance in master over time.

Existing Infrastructure

I suggest to build on Open Data, but take functionality and ideas from the Cycles benchmarking scripts (which are a bit hacky).

Local Testing

The first step is for developers to be able to run tests locally.

  • Create a Python API and utilities for implementing tests given a Blender executable, and return JSON data in Open Data format
  • Implement script to easily run tests locally and present resulting data
  • Implement tests for:
    • Cycles rendering
    • Animation playback
    • Object and mesh editing operators
  • Put benchmark .blend files in lib/benchmarks
Online Testing

Once that infrastructure is in place, we can automate this further:

  • Setup buildbot to let developer performance tests changes
    • By pushing a branch to a repository (or some other simple mechanism)
    • For a code review
    • Generate graph comparing performance before and after
  • Setup buildbot to run performance tests every night on master
    • Generate graph with performance over time
    • Have a way to mark certain commits or benchmark changes as compatibility breaking, so it's clear when numbers are not comparable
    • Make part of Open Data
Open Questions
  • Where would the code for running this live? Open data repository? tests/ folder in Blender repository?
  • How to easily compare multiple revisions? Would the script be responsible for building Blender too (like the Cycles benchmarking scripts)?
  • Who implements this?

Event Timeline

Brecht Van Lommel (brecht) changed the task status from Needs Triage to Confirmed.Fri, Mar 13, 4:07 PM
Brecht Van Lommel (brecht) created this task.

We don't immediately have to build out complicated infrastructure for this. But I thought it would be a good time to create this task now that we are starting work on performance projects.

The first step can just be gathering tests files in lib/benchmarks.

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.

The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. For the animation tests it would help if blender was able to run in the foreground.

Some experience from creating the Cycles benchmarking scripts:

  • Test should run with a dedicated Blender build that has all the proper build flags. Running performance tests with the build used for development means having to switch options too often and stops you from working while the tests run. Building should be handled by the test script.
  • It should be easy to run more tests with a different device, or different benchmark files, add extra revisions to bisect an issue, re-run a failed test, etc. Tests should be queued to run by another script, rather than manually having to manage when a test runs.
  • For Cycles I run each test 3 times and interleaved, and display the variance in graphs to detect tests with unpredictable performance. Disable ASLR and Turbo Boost to get more predictable performance on the CPU. For Cycles renders, test times are usually within 0.1% for different runs.

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.
The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth.

I think that's more difficult in our case. We will have more complicated tests that you probably wouldn't run locally unless you were specifically working on performance? At least I'm not imagining these to be part of our ctests.

For the animation tests it would help if blender was able to run in the foreground.

If we have dedicated machines for performance testing, they should have OpenGL to run such tests. For Blender itself, it's possible to run tests in the foreground, WITH_OPENGL_DRAW_TESTS does it for example.

Dalai Felinto (dfelinto) changed the subtype of this task from "Report" to "To Do".Fri, Mar 13, 5:15 PM

If I understood correctly, a performance test could like the following:

1) Build blender
2) For each test case
        i. run init script (e.g. go to edit mode, get correct context etc..)
        ii. start clock
        iii. <run relevant script> 
        iv. stop clock
3) export results in Open Data .json format

Animation:
We probably need a relative time measurement here, e.g. number of frames * fps - actual time

Object/Mesh Operators:
For me this one is straightforward as long as no human input is needed. We could use a similar approach to the modifiers regression testing but use large meshes with large selection and only restrict the time measurement to the actual operator. Is this also what you were thinking of? Or should we take a similar approach to the Cycles tests and have a complex scene with large objects and meshes, apply many modifiers and operators and see how long the whole thing takes?

Cycles:
Not sure what should be done there, the Cycles benchmark seems to do exactly what you want already?

Yes, that's all correct. Some of the tests might be artificial cases, but definitely real-world scenes is what I'm thinking of. We already have a system for Cycles, the purpose would just be to make all the performance tests use a single system that developers can run on their computer and that we can run every night on the buildbot as well. Beyond that maybe include some in the Blender benchmark.

Looking at open data, it's difficult to re-use a lot. There is about 500 lines of Python code, and particularly the device detection from that we can use. But part is also implemented in Go, things like running Blender with particular command-line options and parsing the Blender output to find the render time. That I think we should have in Python, each type of test needs to work a bit different there, the abstraction should be higher.

I think perhaps the best way forward would be to include this code in the Blender repo, in a way that there is a simple API that the Blender Benchmark can use and provide a nice UI for, but that also can be used for developers to tests locally on their machines.

I prototyped something last weekend, based on my Cycles benchmarking code but cleaner. It's incomplete of course, no support for GPU devices, no graphs, no proper JSON data format, test implementations don't really measure the right thing, etc.
https://developer.blender.org/diffusion/B/browse/performance-test/tests/performance/

Example output from that:

$ ./tests/performance/benchmark
usage: benchmark <command> [<args>]

Commands:
  init                   Set up git worktree and build in ../benchmark

  list                   List available tests
  devices                List available devices

  run                    Execute benchmarks for current revision
  add                    Queue current revision to be benchmarked
  remove                 Removed current revision
  clear                  Removed all queued and completed benchmarks

  status                 List queued and completed tests

  server                 Run as server, executing queued revisions

Arguments for run, add, remove and status:
  --test <pattern>       Pattern to match test name, may include wildcards
  --device <device>      Use only specified device
  --revision <revision>  Use specified instead of current revision

$ ./tests/performance/benchmark list
cycles_wdas_cloud    CPU
undo_translation     CPU

$ ./tests/performance/benchmark run --test undo*
f9d8640              undo_translation     CPU        [done]     1.1524s
$ git checkout other-branch
$ ./tests/performance/benchmark run --test undo*
87c825e              undo_translation     CPU        [done]     1.1555s

$ ./tests/performance/benchmark status
f9d8640              undo_translation     CPU        [done]     1.1524s
87c825e              undo_translation     CPU        [done]     1.1555s

That actually looks nice. I just saw that the code is in the branch performance-test. I would very much like to contribute to it. What I would like to do next is:

  • Put a layer of abstraction between environment.Test and the actual test case, e.g. AnimationTest such that adding a new test can be done by just adding a new spec, e.g. list of parameters for a modifier or a blend file path for animation or rendering.
  • Implement an interface class for the following areas:
    • Animation
    • Modifiers
    • Operators (object and edit mode)
    • Compositor
    • Cycles
    • Custom script maybe?

I still have to look at the Open Data benchmark more closely to understand the format and how a GUI would interact with the benchmark and see how much I can re-use for the input/output as well.

I was expecting AnimationTest itself to be that abstraction layer, multiple instances with different .blend files can already be generated. If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be. "custom script" I don't understand, that's what is intended to be possible already.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code. I think creating modifier regression tests should be simplified, so that the input is only a .blend file with a bunch of test mesh object, and tests are defined fully by a line of Python code. Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Mainly I was thinking of actual production files and not so much synthetic tests, so I haven't thought about that design much.

Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Actually I was discussing just that with @Himanshi Kalra (calra). But we might need a few more adjustments to make the framework usable as a performance test.

If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be

I was thinking about simplifying adding more tests. It might be simple for animation or cycles but operators need operator specific paramters/selection and I want to avoid creating a new class for each new (type of) operator.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code

Sure, I was thinking about creating the interface only. The implementation can use the code from regression tests.

Yes, I have added it as my 1st Deliverable in Gsoc, for regression testing it would be better if the user has a choice of doing it in Blender as well as with just a line of Python.