The solution for T50981 does NOT include starvation detection. In other words, if a job has certain tasks that were failed by all available workers (and thus all workers are blacklisted for this job & task type) there is no detection that this happened. As a result, the job will be stuck in 'active' status without it ever having a chance of being finished.
We should have the Manager regularly inspect queued tasks, to see if there is still at least one worker that is not blacklisted and able to execute them. If not, we can mark those tasks as 'failed' to reflect the actual failure on each worker.