Alfred: The Automaton's Automaton

The Big Picture

Alfred is a work distribution system that manages a hierarchy of parallel client applications connected to remote servers. The Alfred components are general purpose but they are especially well-suited for managing network-distributed rendering in the context of the RenderMan Artist Tools.

In operation, Alfred traverses a job script called a worklist, which defines a tree-structured hierarchy of dependent tasks. Each task node in the tree contains commands that are executed on the local client host, possibly using associated remote servers. Multiple jobs may be queued up and performed in succession or in parallel. These terms are defined in more detail elsewhere.

Alfred functions as a system of interconnected components:

monitor: The monitor is the user interface. It displays, and provides user control over, the current state of the job queue for a particular dispatcher.
dispatcher: The dispatcher is the job queue manager, doing the actual work of reading the job scripts and launching the individual commands. The dispatcher executes tasks in parallel when the worklist provides for it and sufficient resources are available. Typically there are several dispatchers on the network, each managing a user's job queue and local clienting load. Dispatchers negotiate with the maitre-d to acquire remote servers.
maitre-d: The maitre-d is a centralized arbitrator for server allocation requests from all the dispatchers on the network. It operates from a master schedule that lists the available network services, such as alfservers, and describes which users have permission to use them.
nimby: The nimby mode Alfred (“nimby” is an acronym for “Not In My Back Yard”) is a small desktop utility that communicates with the maitre-d and blocks remote work from being dispatched to systems with interactive users; it does allow remote work when the system is idle, during screen-saver periods.; top

Motivation

Alfred was developed to handle the distributed computing requirements of the Pixar RenderMan Artist Tools. In particular, MTOR uses Alfred to manage the rendering of complex scenes. This often involves a complex hierarchy of dependent operations, such as generating texture, shadow, and environment maps, as well as the final frame rendering. Typically these rendering jobs will also benefit from parallel processing, especially for multi-frame sequences. One of Alfred's most important features is its ability to detect parallelism opportunities and exploit them to take advantage of "render farms" and other network resources. Furthermore, Alfred coordinates access to these resources among many users and enforces locally defined access policies and priorities.

One of the primary design goals was to address the problem of coordinating shared access to remote rendering servers for a group of artists in a typical production environment. Their typical design-preview-redesign cycles led to a system that emphasizes independent, point-of-origin clienting, rather than a centralized job processing queue. Local clienting also has important practical benefits when interoperating with a system such as Maya®, which relies on per-user directory structures, permissions, and, frequently, local files. Local dispatching also helps to keep jobs under their owner's control, and to keep the sometimes significant clienting load from throttling a central queuing machine. Local control also allows preview images to be rendered directly to the local framebuffer, even when utilizing one or more remote servers (via netrender).

The distributed dispatchers coordinate their use of shared servers through a single check-out request broker, called the maitre-d. This daemon keeps track of which servers are available and also enforces user access and priority policies. Dispatchers can also function autonomously, with no maitre-d and minimal set-up requirements, although the mechanism for resolving server competition is less efficient.

Another important design goal was to support the on-demand RIB generation capability of MTOR. This includes delaying RIB requests until remote servers are available, making multiple RIB requests of a generator that is launched once at the beginning of the job, being able to clean-up input files associated with completed frames, and the ability to dynamically add subtrees to the running job script.

The resulting “distributed dispatching” system is therefore appropriate for the typical studio environment, which has a mixture of interactive and batched jobs competing for the same resources. There is a default emphasis on “fairness” and maximum server utilization. The scheduling system also provides straightforward mechanisms for favoring specific groups of users or servers. Sites that are currently using a centralized spooling system, which maintains a single job queue shared by all users, will probably find that this autonomous dispatching/clienting model takes some getting used to, but that it can also support the useful aspects of centralized administration and status monitoring.

top

Important Concepts

Here are some points to keep in mind when working with Alfred:

Worklists are typically written by applications.

There is one dispatcher per user, per spooling host.

There may be multiple monitors per dispatcher.

Dispatchers persist in the background.

The Master Schedule File is an important resource.

The maitre_d provides centralized arbitration of server requests.

The dispatcher saves job checkpoint files.

The worklist concept is very general.

Worklist commands may be messages to live apps.

The worklist language is straightforward.

An Alfred-compliant app is easy to create.

top

Worklists are typically written by applications.

Alfred is most useful when it manages the execution of a long or complex sequence of dependent events. Typically the input scripts are generated by an application program that is creating such sequences; these applications then invoke Alfred directly. There is a special case for handling a bunch of existing RIB files, you can just issue the command:

      alfred  file1.rib  file2.rib ...

This will automatically create a simple job which netrenders each file, making use of parallel servers when possible. There is also a mechanism for automatically creating processing hierarchies based on a file name structuring convention (see the man page notes).

[back to the list]

There is one dispatcher per user, per spooling host.

That is, each user who spools a job owns a dispatcher that executes that job. There is only one dispatcher per user (per clienting host): any additional spooled jobs will be queued by the original dispatcher and processed in the order received. The first invocation of Alfred will become a dispatcher. If it is still running when subsequent Alfreds are started they simply add their jobs to the queue of the original dispatcher. The user interface shows the list of queued jobs and provides a way to reorder it. This design has some useful consequences:

Correct permissions. The dispatcher is running as the user and consequently has their permissions; hence, file access permissions for resources such as texture files and models don't need to be any different for Alfred than for the application that is launching it.
Queue control by users. Users maintain control over their own queue of jobs throughout the execution process and can reorder or delete them at any time. Priorities on servers are determined by a centralized maitre_d process or shared schedule file.
Decentralized parallel clienting. There is no centralized processing bottleneck since each dispatcher does its own clienting. This can be important if intensive operations such as model loading and traversal, texture and RIB generation, final frame compositing, compression, or final archiving are performed by the client host.

[back to the list]

There may be multiple monitors per dispatcher.

Several users on different workstations may monitor the progress of a particular dispatcher. Monitors and dispatchers communicate over TCP/IP (Internet) sockets, and a dispatcher broadcasts status updates to any connected monitors. Each monitor generates its UI graphics locally (using Tk), so the dispatcher sends only compact status messages to each monitor. An authentication scheme (similar to rlogin) is used to determine whether a monitor is view-only or if it may modify the dispatching queue. There is also support for using web browsers (HTTP clients) to get Alfred job status.

[back to the list]

Dispatchers persist in the background.

While there are jobs to execute the dispatcher will continue to run. Monitors may disconnect and reconnect at any time. When a job is spooled to Alfred a dispatcher process is forked (if one doesn't exist) and it persists until all queued jobs are done and removed from the queue.

[back to the list]

The Master Schedule File is an important resource.

The schedule provides Alfred with the names and types of all remote servers, such as alfserver hosts. It also specifies which users have access to each server, as well as any time-of-day access restrictions. This is the primary resource management mechanism controlling Alfred. For example, a project coordinator might allocate certain alfservers to separate scenes of a production, giving more hosts to higher priority work.

[back to the list]

The maitre_d: centralized arbitration of server requests.

By default, each running Alfred (invoked as a dispatcher) will choose its remote servers from the master schedule file. It is also possible to centralize this server allocation function on a network, so that multiple dispatchers make requests of a single broker. You do this by appropriately configuring the Alfred initialization file and starting a single Alfred in maitre_d mode.

The maitre-d is only needed in environments where multiple dispatchers are competing for the same remote servers (e.g., when several users have permission to access the same alfserver at the same time). However, even without the maitre_d, independent “chaos mode” Alfreds will still function. They will just race each other for resources, which adds a little overhead, and the job priority scheme won't function: every dispatcher would appear to have the same priority. See the scheduling reference for more information.

[back to the list]

The dispatcher saves job checkpoint files.

As job tasks complete, the dispatcher writes a checkpoint version of the originally spooled worklist into the spooling directory. These files can be used to restart a partially completed job from the state of the last checkpoint (if, for example, the dispatcher is killed or its host crashes). Every time a dispatcher starts it checks for unfinished jobs.

[back to the list]

The worklist concept is very general.

Locally the worklist is a script. This means that RIB files can be generated just before launching the netrender that actually needs them and completed RIB files can be removed afterwards. Globally the worklist is like a makefile or other dependency-based scheme. The dispatcher's major function is to launch programs in a prescribed sequence taking advantage of parallelism where possible. The actual programs to be launched are named in the worklist and can essentially be arbitrary executables. Alfred does have some built-in dispatching heuristics that have special knowledge of netrender and alfservers, but these are optimizations for rendering and don't affect other types of commands. Alfred also assumes that executables send a non-zero return code if (and only if) an error occurs; this is not always true.

[back to the list]

Worklist commands may be messages to live apps.

Alfred supports the concept of live, or persistent, programs that act on messages sent to them over the course of the job.

[back to the list]

The worklist language is straightforward.

Scripts consist of (nested) calls to a small number of operators (which define the programs to execute), and an ordering scheme (typically a hierarchy of parallel dependencies). See script syntax for more information.

[back to the list]

An Alfred-compliant app is easy to create.

To create and spool Alfred jobs, an application creates a worklist script file and then launches "alfred filename", which automatically starts a dispatcher and monitor (if they aren't running already) and spools the job.

Applications that are launched from within Alfred scripts are typically invoked as they would be from the shell command line. In addition to simple one-time commands that process their argument list, it is also possible to use stdio to send command messages to launched apps. See the compliant application notes for more details.

[back to the list]

Various system resource definitions and behavior-tuning parameters are defined in simple configuration scripts, which can be overridden on a per-host and per-user basis. Temporary files are created in a spool directory, and they persist until the job is removed. See the manual page for the location of these files.

We are sometimes asked about the meaning of the name "Alfred". While the origin is shrouded in mystery, it has been said that Alfred is your conscientious task master, an automaton's automaton, making sure that things get done in an orderly and timely fashion, freeing you to attend philanthropic dinner parties and wear rubber suits.

top

Alfred — Overview and Concepts

Contents of the Alfred Documentation

The Big Picture

Motivation

Important Concepts

Worklists are typically written by applications.

There is one dispatcher per user, per spooling host.

There may be multiple monitors per dispatcher.

Dispatchers persist in the background.

The Master Schedule File is an important resource.

The maitre_d: centralized arbitration of server requests.

The dispatcher saves job checkpoint files.

The worklist concept is very general.

Worklist commands may be messages to live apps.

The worklist language is straightforward.

An Alfred-compliant app is easy to create.