Death, Taxes, and Render Farm Errors
Version: Deadline 8.0
Errors occur during rendering. It's just an accepted fact for anyone that's used or maintained a render farm. They can occur for a variety of reasons, including:
- Missing assets.
- Missing software or plugins, or mismatched versions.
- Unlicensed software or plugins.
- Lack of system resources.
- Lost network connectivity.
While we all dream of a render farm that's free of errors, we know they're ultimately unavoidable. What's important is how Deadline is able to detect render errors and handle them to keep your farm running as smoothly as possible, and that's what this blog entry will focus on. In addition, it will take a deeper dive into how you can customize the way Deadline responds to render errors.
VIEWING RENDER ERRORS
Before we dive into details about error detection, it's important to know how to view render errors in the Deadline Monitor. A quick way to tell if errors are occurring for your job is look at the job's color in the Jobs Panel. If you see an active job turning a greenish-brown color, you'll know that errors are occurring, and you can view the Error column in the job list to see how many have occurred.
Note that if the jobs with errors are not changing color, it's likely due to the current settings in your Monitor Options. Specifically, if the Change Color Of Jobs/Tasks That Accumulate Errors settings under Job List and Task List are disabled, the jobs/tasks won't change color.
To view the errors, simply right-click on the job and select View Job Reports. This will show you all the reports for the job (logs, errors, and requeues). The error reports are red, and you can click on one to see the full report.
In this case, we can see that the render failed because the V-Ray renderer for Maya couldn't find a license. Note that while Deadline tries to include the exact cause of the error at the top of the report, sometimes it's necessary to scroll down and view the actual render log to put the error message in better context.
Additional information about viewing render errors can be found in the Monitoring Jobs section of the documentation.
Now that you know how to view the errors, let's take a look at how Deadline detects them.
Error detection is done exclusively by the various render plugins that are included with Deadline. While there are some standard checks in place across all the plugins, not all rendering applications report render errors in the same way. As a result, many render plugins have application specific checks, which allows for more robust error detection and handling.
Because there are many ways that Deadline can detect errors while rendering a job, we're only going to focus on a few common ones. We're not going to dive too deep with these though, because then this would turn into a blog entry about plugin development, and we've already done one of those.
Almost every one of Deadline's render plugins require you to specify the path(s) to the executable that will be used for rendering. This can be done from the Deadline Monitor while in Super User mode by selecting Tools > Configure Plugins. Then select the plugin you want to configure from the list on the left. For this example, we're going to choose Nuke.
You'll notice that you have the ability to specify multiple paths for each executable. When there are multiple paths specified, Deadline will go through them in order and use the first executable that exists on the render node. While performing this check, if none of the executables are found on the render node, Deadline will report an error that looks like this:
Error: Nuke 9.0 render executable could not be found in the semicolon separated list "C:\\Program Files\\Nuke9.0v1\\Nuke9.0.exe;/usr/local/Nuke9.0v1/Nuke9.0;/Applications/Nuke9.0v1/Nuke9.0v1.app/Contents/MacOS/Nuke9.0v1". The path to the render executable can be configured from the Plugin Configuration in the Deadline Monitor.
So when you see this error, you know that you either need to add a valid path for the executable in the plugin configuration, or you need to install the software on this render node.
When a render plugin is wrapping a command line render, it will always check the exit code of the rendering application when it exits. Typically, command line applications will return an exit code of zero if the render succeeded, so by default, Deadline will treat any non-zero exit code as an error. When this happens, you will likely see something like this show up in a job error report:
Error: Renderer returned non-zero error code 208. Check the log for more information.
Note that the exit code itself is not always enough to determine the actual cause of the error, so when you see this error, it is recommended to read the rest of the render log in the error report to try and find the actual cause.
While this check is pretty reliable, there are rare cases where an application will return a non-zero exit code when the render succeeds. If you are seeing this behavior, contact Thinkbox Support so we can determine if we need to update a specific render plugin to ignore the exit code.
Some render plugins don't simply wrap a command line render. Instead, they launch the rendering application in a "listening" mode and feed it commands to control the rendering process. While more complex than command line plugins, they offer more flexibility, including the ability to keep the scene file for a job in memory between tasks to reduce rendering overhead.
In these cases though, checking the exit code of the application isn't an option. Instead, Deadline actively monitors the process and will automatically detect if the application exits unexpectedly. When this happens, you'll see an error like this in the error report:
Error: Monitored managed process "Fusion" has exited or been terminated.
This is an example of where the error message itself doesn't tell the full story. But if you scroll down the error report and view the render log, it should hopefully explain why the application crashed.
There are errors that can occur during a render that won't cause the application to return a non-zero exit code or crash. For example, an asset file could be missing, and the rendering application doesn't treat this as a fatal error. Instead, it prints out a warning to the render log and continues rendering as if nothing is wrong. The problems with errors like this is that if they go undetected, a job could spend hours rendering incorrect results.
This is a case were application specific error detection becomes useful. Render plugins can define "Stdout Handlers", which use regular expressions to match specific text that the rendering application writes to the render log. When the text is matched, it triggers a response that can fail the render with a useful error message. For example, Cinema 4D prints out the following when it detects a missing asset:
Rendering failed: Asset missing
To handle this, Deadline's Cinema 4D render plugin has a Stdout Handler defined that looks for this text and fails the render when it's detected. While the error is pretty clear, it would be good to know which asset is missing. Thankfully, if you scroll down the error report and view the full log, you'll find the info you're looking for:
2015-09-07 15:29:28:0: STDOUT: Loading Project: R:\server\scenes\scene.c4d 2015-09-07 15:29:28:0: STDOUT: Rendering frame 1072 at <Mon Sep 07 15:29:28 2015> 2015-09-07 15:29:28:0: STDOUT: Rendering Phase: Setup 2015-09-07 15:29:28:0: STDOUT: Progress: 0% 2015-09-07 15:29:32:0: STDOUT: Texture Error: input_movie.mov (Mat) 2015-09-07 15:29:32:0: STDOUT: Rendering failed: Asset missing
If you discover a fatal error message being printed to the render log of one of your jobs that isn't being detected by Deadline, contact Thinkbox Support so we can determine if we need to add a new Stdout Handler to a specific render plugin. The render log is very useful in this case, because it shows us the message we need to handle.
Some applications will actually display a blocking popup message when an error occurs. These are bad, because they will stall the render until someone manually closes the popup dialog. To avoid this problem, Deadline has built-in popup detection (on Windows render nodes only), and will automatically fail a render whenever a popup is detected. When this happens, you will see an error message that looks something like this:
An exception occurred: Error: Dialog popup detected: Title "Rhinoceros 4.0 Startup Template Error", Message "You must select a file name first"
If you scroll down the render log in the error report, you'll even see a dump of the popup contents. For example:
0: INFO: Detected popup dialog "Rhinoceros 4.0 Startup Template Error". 0: INFO: ---- dump of dialog ---- 0: INFO: Button: OK 0: INFO: Static: 0: INFO: Static: You must select a file name first 0: INFO: ---- end dump of dialog ----
Note that there are cases where a popup isn't actually an error. In these cases, if Deadline could handle the popup by "pressing" a specific button, it would allow the render to continue. This is where Deadline's Popup Handlers come into play. Like Stdout Handlers, these are defined on a per application basis in the render plugin, and they tell Deadline to "press" a button on the popup to allow the render to continue.
If you have a job that is failing to render because of a popup, but you think the popup could be handled by Deadline, contact Thinkbox Support so we can determine if we need to add a new Popup Handler to a specific render plugin. The dump of the dialog is very useful in this case, because it shows us the available buttons to press.
So far, we've covered how to view render errors, and how Deadline detects them. Now let's focus on how Deadline responds to errors, which can be configured in the Repository Options in the Deadline Monitor. To view the settings, make sure you're in Super User Mode, and then select Tools > Configure Repository Options. Then select Job Settings in the list on the left, and select the Failure Detection tab.
There is a default Failure Detection settings for Deadline. There are two settings I want to talk about first, since they are enabled by default. We can then cover the rest of the settings in more detail.
The first is the Mark a job as failed after it has generated this many errors setting under Job Failure Detection. By default, this is set to 100. This means that if a job generates 100 errors, it will be marked as failed, and will no longer be picked up by render nodes until the job is manually resumed. This ensures that a single problematic job won't be attempted indefinitely, preventing render nodes from moving on to other jobs.
The second is the Pick next available task for a job if the job's previous task generated an error setting under Worker Failure Detection. This setting is enabled by default, which means if a render node reports an error for Task 0 of a job, it will then try to render Task 1 instead of trying Task 0 again. This ensures a single problematic task won't prevent the rest of the job from rendering. In addition, if the render node renders Task 1 successfully, it will go back to attempt Task 0 again in case the original error was just a random glitch.
OTHER FAILURE DETECTION SETTINGS
Now let's cover all the Failure Detection settings in detail.
Job Failure Detection:
- Send a warning to the job's user after it has generated this many errors: If enabled, a warning will be sent to the job's user (and anyone else on the job's notification list) once it has accumulated this many errors. This value should be less than the number of errors required to fail the job (typically, half is a good amount), which ensures the job's user will have time to review the errors before the job fails.
- Mark a job as failed after it has generated this many errors: We covered this above.
- Mark a task as failed after it has generated this many errors: If enabled, an individual task for a job will be marked as failed when that task accumulates this many errors. This can be used to ensure a problematic task won't cause the rest of the job to fail.
- Automatically delete corrupted jobs from the Repository: If enabled, if a job is found to be corrupted it will it will be automatically removed from the render farm. To be honest, this probably belongs in the Cleanup tab, not the Failure Detection tab, but that's in internal discussion for us to have.
- Maximum Number of Job Error Reports Allowed: This is the maximum number of error reports each job can generate, and this is capped for performance concerns. Otherwise, an unchecked job could potentially, generate tens or hundreds of thousands of error reports, which would take a while to view in the Deadline Monitor. Once a job generate this many errors, it will fail and cannot be resumed until some of its error reports are deleted or this value is increased.
Worker Failure Detection
- Pick next available task for a job if the job's previous task generated an error: We covered this above as well. The only thing I'll add is that this setting is ignored by Sequential jobs, since those jobs require that each task be rendered in sequential order.
- Send a warning after a Worker has generated this many errors for a job in a row: If enabled, an email notification will be sent if a Deadline Worker generates this many errors in a single session. The email is sent to the email addresses specified for the Worker Error Warning setting in the Email Notification options.
- Mark a Worker as bad after it has generated this many errors for a job in a row: If a Deadline SlWorkerve generates this many errors for the same job, it will be marked as "bad" for that job. When this happens, the Worker will no longer try to render tasks for that job. This can be used to ensure a problematic render node doesn't cause jobs to fail if it's the only node generating errors. Note that Workers in this "bad" list can be viewed and removed list in the Job Properties.
- Frequency at which a worker will attempt a job that it has been marked bad for: This is used with the previous setting above. It is the odds that Deadline Worker will attempt a task from a job it has been marked as "bad" for if no other jobs are available. This can be useful if you want your render nodes to make occasional attempts at these jobs if there are no other jobs in the queue. If you don't want to make these occasional attempts, simply set this value to 0%.
FAILURE DETECTION OVERRIDES
The Failure Detection settings in the Repository Options are global settings that apply to all jobs. However, some of these settings can be overridden on a per job basis. To see the settings you can override, right-click on a job in the Job List in the Monitor and select Modify Job Properties. Then select Failure Detection from the list on the left.
Here are the available settings that can be overridden on a per job basis:
- Override Job Error Limit: This overrides the Mark a job as failed after it has generated this many errors setting in the Repository Options.
- Override Task Error Limit: This overrides the Mark a task as failed after it has generated this many errors setting in the Repository Options.
- Send Warning Notification For Job Errors: This partially overrides the Send a warning after a Worker has generated this many errors for a job in a row setting in the Repository Options. It can be used to override whether or not a notification is sent, but not the number of errors required for the notification.
- Ignore Bad Worker Error Limit: This partially overrides the Mark a Worker as bad after it has generated this many errors for a job in a row setting in the Repository Options. It can be used to prevent Deadline Workers from being added to the job's bad list.
Finally, you can view and remove Deadline Workers from the job's "bad" list here.
Overriding these settings can be useful, especially if you're running test jobs or debugging problems. For example, if you're trying to solve a known error and you're using a job to troubleshoot, you don't want that job failing over and over again. In this case, you can override the job and task limits and set their values to 0. It also might be preferable to disable the warning notification for job errors, and ignore the bad Worker detection.
That basically covers everything you need to know about Deadline's error handling capabilities. We can never eliminate render errors completely, but at least Deadline makes it easy to live with them.
As mentioned earlier, if you discover an error that Deadline isn't detecting properly, please contact Thinkbox Support so we can investigate further!