Dispatching: Job Traversal and Command Launching


Introduction

The following discussion of Alfred dispatching concepts is intended as a guide to understanding the basic job traversal and command launching mechanism at the heart of Alfred, with specific attention to the user-selectable dispatching schemes. For details on creating hierarchical Alfred job scripts, see the scripting documentation.

The Dispatcher is one of several Alfred components, and it is the piece which actually launches and tracks the commands which make up a job. It also enforces launch sequences by keeping track of dependencies. And finally, the dispatcher is responsible for tracking and reporting errors.

 

The Idea of a Hierarchical Job

The diagram above illustrates the structure of a simple rendering job. The shot consists of several frames, and any actions to be taken at the shot level must wait for the actions at the frame level to complete first. In the simplest Alfred job, the frame-level action might be to invoke the "render" command on an appropriate RIB file, and there might be no action at all at the shot level or perhaps it just cleans up the RIB files.

It's clear that if all we wanted to do was render six RIB files that we could just launch the commands by hand or write a very simple, linear, shell script which accomplished this trivially.

Alfred derives almost all of its complexity, and power, from addressing the following observations:

In the example above, if there were six rendering servers available on the network we could compute six frames at once, rather than in sequence. If there were only two remote renderers available we can still do better than linear time by processing the frames in pairs.  

Terminology


Depth-First Traversal and Evaluation

Task Search Order Evaluation Order (Sequential)

Alfred searches its job tree in depth first order for commands which can be launched. The diagram on the left shows a simple hierarchy with the nodes numbered in depth-first traversal order; this is the sequence in which Task nodes are searched when looking for commands to launch.

The diagram on the right shows the evaluation order required to enforce the desired hierarchy of dependencies. The nodes are numbered assuming that only one can run at a time, this is strict Task Sequential execution (see below).

If tasks are allowed to execute in parallel, then the dependency constraints allow the order of evaluation indicated by the letters: all of the A nodes can execute at the same time, then the B nodes, and finally C, the root node. Parallel execution is the default, given enough available resources.

Note: strictly speaking, Alfred only launches one command at a time; parallel execution occurs because it doesn't wait for the first task to finish before looking for another one to launch.

Alfred always searches for launchable tasks in depth-first order, and it launches the first available leaf node it can find. For example, assuming that all the nodes above represent approximately equal rendering tasks, and that we have enough processors to do two renderings simultaneously, then the execution order might be: 1A and 2A together, then 3B and 4A together, then 5A, then 6B, then 7C.

Actual launching order is very job-specific since it depends on how long each task really takes, the actual task dependencies and the number of processors that become available during the course of the job.

 


Dispatching Schemes

It is often the case that there are several jobs on the local dispatching queue at once. Alfred provides a choice of several dispatching schemes which determine the order in which tasks from different jobs are processed. The current dispatching mode is set as a per-user preference from the Session->Preferences menu. In general, Spill-Over mode will be the most efficient, and it is the default. At some sites, system administrators may want to enforce a uniform dispatching scheme across all users. See the notes on LOCKing user preferences in the alfred.ini file.

Process Launch and Tracking

During job traversal alfred will determine which Tasks are ready to be processed. A task becomes ready to execute when all of its dependencies have been met, which means that all of its subtasks have completed successfully.

Each non-trivial Task contains Cmd descriptions which describe the actual applications to be launched. When a Cmd is found in a Ready Task, it may not be executed immediately, especially if it requires a scarce resource such as a remote rendering server. The dispatcher contacts the the maitre-d to check-out resources and acquire the final launch clearance.

When the required checks are cleared, the actual command launch and subsequent process tracking involves these steps:


Command Errors and Blocking

As just noted, when a launched command exits with a non-zero exit status, it is considered to have had an error. Referring again to the numbered task diagram above, if the command launched by task "3B" encounters an error, it will stop execution and indicate the error; also any output diagnostics generated by the command will be made available.

Note that the entire job might not stop when an error occurs. If there are other ready tasks which do not depend on the stopped error task, then they can continue as usual. There is a toggle-switch on the Session->Preferences panel which controls whether this kind of parallel processing continues after an error occurs.

Eventually, all that will remain will be the task with the error and its parent tasks which depend on it. At this point the job won't be able to proceed any farther, and it will enter the Error-Wait state. If the dispatching scheme is Spill-Over or Parallel then other jobs on the local queue will begin to run; if the scheme is Sequential then the entire job queue will also wait.

Tasks with errors can either be manually restarted or skipped. If automatic retries have been enabled on the Session->Preferences control panel, then the dispatcher will attempt to re-launch the command a fixed number of times.


Deleting Jobs, and the Dispatcher Lifetime

Jobs can be deleted simply by clicking on their entry in the Job Queue window and pressing the Backspace or Delete keys, or selecting Delete Job from the menu.

When currently active jobs are deleted, all of their active tasks are terminated and clean-up commands along DAG paths containing active tasks are executed. Also, by default, a dialog appears asking to confirm that you really want to delete an active job; this dialog can be disabled if desired.

There is also an automatic job deletion preference setting which will automatically delete done jobs or maintain a short list of the most recently completed jobs while deleting older ones.

When a job is deleted, all output logs and status information are also deleted, therefore it is sometimes useful to retain jobs until you've had a chance to look for unusual output. Job can be browsed from either the Job Detail Window or via the web interface, both of which rely on a running dispatcher to supply status information.

By default, the dispatcher doesn't exit until all jobs have been deleted from the Job Queue. This conservative approach ensures that job status information is available for done and active jobs until a person chooses to delete the jobs. The dispatcher will therefore continue to run, even when all the user interfaces have been closed. Alternatively, there is a preference setting which specifies that the dispatcher should delete all done jobs when the last UI closes, and it can therefore shutdown as well.

 

Pixar Animation Studios
(510) 752-3000 (voice)   (510) 752-3151 (fax)
Copyright © 1996- Pixar. All rights reserved.
RenderMan® is a registered trademark of Pixar.