Alfred is a work distribution system that manages a hierarchy of parallel
client applications connected to remote servers. The Alfred components are general purpose but they are especially well-suited for managing network-distributed rendering in the context of the
RenderMan Artist Tools.
In operation, Alfred traverses a job script called a worklist,
which defines a tree-structured hierarchy of dependent tasks. Each
task node in the tree contains commands that are executed on the
local client host, possibly using associated remote servers.
Multiple jobs may be queued up and performed in succession or in parallel.
These terms are defined in more detail elsewhere.
Alfred functions as a system of interconnected components:
-
monitor
-
The monitor is the user interface. It displays, and provides user control
over, the current state of the job queue for a particular dispatcher.
-
dispatcher
-
The dispatcher is the job queue manager, doing the actual work of reading
the job scripts and launching the individual commands. The dispatcher executes
tasks in parallel when the worklist provides for it and sufficient resources
are available. Typically there are several dispatchers on the network,
each managing a user's job queue and local clienting load. Dispatchers
negotiate with the maitre-d to acquire remote servers.
-
maitre-d
-
The maitre-d is a centralized arbitrator for server allocation requests
from all the dispatchers on the network. It operates from a master schedule
that lists the available network services, such as alfservers,
and describes which users have permission to use them.
-
nimby
-
The nimby mode Alfred (nimby is an acronym for Not In My Back Yard) is a small desktop utility
that communicates with the maitre-d and blocks remote work from being
dispatched to systems with interactive users; it does allow remote work
when the system is idle, during screen-saver periods.
Alfred was developed to handle the distributed computing
requirements of the Pixar RenderMan Artist Tools. In particular, MTOR uses Alfred to manage the rendering of
complex scenes. This often involves a complex hierarchy of dependent
operations, such as generating texture, shadow, and environment maps, as
well as the final frame rendering. Typically these rendering jobs will
also benefit from parallel processing, especially for multi-frame
sequences. One of Alfred's most important features is its ability to
detect parallelism opportunities and exploit them to take advantage
of "render farms" and other network resources. Furthermore,
Alfred coordinates access to these resources among many users and enforces
locally defined access policies and priorities.
One of the primary design goals was to address the problem
of coordinating shared access to remote rendering servers for a group of
artists in a typical production environment.
Their typical design-preview-redesign
cycles led to a system that emphasizes independent, point-of-origin clienting,
rather than a centralized job processing queue. Local clienting also has
important practical benefits when interoperating with a system such as
Maya®, which relies on per-user directory structures, permissions,
and, frequently, local files. Local dispatching also helps to keep jobs
under their owner's control, and to keep the sometimes significant clienting
load from throttling a central queuing machine. Local control also allows
preview images to be rendered directly to the local framebuffer, even when
utilizing one or more remote servers (via netrender).
The distributed dispatchers coordinate their use of shared
servers through a single check-out request broker, called the maitre-d.
This daemon keeps track of which servers are available and also enforces
user access and priority policies. Dispatchers can also function autonomously,
with no maitre-d and minimal set-up requirements, although the mechanism
for resolving server competition is less efficient.
Another important design goal was to support the on-demand
RIB generation capability of MTOR. This includes delaying RIB requests
until remote servers are available, making multiple RIB requests of a generator
that is launched once at the beginning of the job, being able to clean-up
input files associated with completed frames, and the ability to dynamically
add subtrees to the running job script.
The resulting distributed dispatching system is therefore
appropriate for the typical studio environment, which has a mixture of interactive
and batched jobs competing for the same resources. There is a default emphasis
on fairness and maximum server utilization. The scheduling system also
provides straightforward mechanisms for favoring specific groups of users
or servers. Sites that are currently using a centralized spooling system,
which maintains a single job queue shared by all users, will probably find
that this autonomous dispatching/clienting model takes some getting used
to, but that it can also support the useful aspects of centralized administration
and status monitoring.
Here are some points to keep in mind when working with Alfred:
-
Worklists are typically written by applications.
-
There is one dispatcher per user, per spooling
host.
-
There may be multiple monitors per dispatcher.
-
Dispatchers persist in the background.
-
The Master Schedule File is an important resource.
-
The maitre_d provides centralized arbitration of server
requests.
-
The dispatcher saves job checkpoint files.
-
The worklist concept is very general.
-
Worklist commands may be messages to live
apps.
-
The worklist language is straightforward.
-
An Alfred-compliant app is easy to create.
Worklists are typically written by applications.
Alfred is most useful when it manages the execution of a long or complex
sequence of dependent events. Typically the input scripts are generated
by an application program that is creating such sequences; these applications
then invoke Alfred directly. There is a special case for handling a bunch
of existing RIB files, you can just issue the command: alfred file1.rib file2.rib ...
This will automatically create a simple job which netrenders each
file, making use of parallel servers when possible. There is also a mechanism
for automatically creating processing hierarchies based on a file name
structuring convention (see the
man page notes).
There is one dispatcher per user, per spooling
host.
That is, each user who spools a job owns a dispatcher that executes that
job. There is only one dispatcher per user (per clienting host): any additional
spooled jobs will be queued by the original dispatcher and processed in
the order received. The first invocation of Alfred will become a dispatcher.
If it is still running when subsequent Alfreds are started they simply
add their jobs to the queue of the original dispatcher. The user interface
shows the list of queued jobs and provides a way to reorder it. This design
has some useful consequences:
-
Correct permissions. The dispatcher is running as the user and
consequently has their permissions; hence, file access permissions for resources
such as texture files and models don't need to be any different for Alfred
than for the application that is launching it.
-
Queue control by users. Users maintain control over their own queue
of jobs throughout the execution process and can reorder or delete them
at any time. Priorities on servers are determined by a centralized maitre_d
process or shared schedule file.
-
Decentralized parallel clienting. There is no centralized processing
bottleneck since each dispatcher does its own clienting. This can be important
if intensive operations such as model loading and traversal, texture and
RIB generation, final frame compositing, compression, or final archiving
are performed by the client host.
There may be multiple monitors per dispatcher.
Several users on different workstations may monitor the progress of a particular
dispatcher. Monitors and dispatchers communicate over TCP/IP (Internet)
sockets, and a dispatcher broadcasts status updates to any connected monitors.
Each monitor generates its UI graphics locally (using Tk), so the dispatcher
sends only compact status messages to each monitor. An authentication scheme
(similar to rlogin) is used to determine whether a monitor is view-only
or if it may modify the dispatching queue. There is also support for using
web
browsers (HTTP clients) to get Alfred job status.
Dispatchers persist in the background.
While there are jobs to execute the dispatcher will continue to run. Monitors
may disconnect and reconnect at any time. When a job is spooled to
Alfred a dispatcher process is forked (if one doesn't exist) and it persists
until all queued jobs are done and removed from the queue.
The Master Schedule File is an important resource.
The schedule provides Alfred with the names and types of all remote servers,
such as alfserver hosts. It also specifies which users have access to each
server, as well as any time-of-day access restrictions. This is the primary
resource management mechanism controlling Alfred. For example, a project
coordinator might allocate certain alfservers to separate scenes of a production,
giving more hosts to higher priority work.
The maitre_d: centralized arbitration of server requests.
By default, each running Alfred (invoked as a dispatcher) will choose its
remote servers from the master schedule file. It is also possible to centralize
this server allocation function on a network, so that multiple dispatchers
make requests of a single broker. You do this by appropriately configuring
the Alfred initialization file and starting a single Alfred in maitre_d
mode.
The maitre-d is only needed in environments where multiple dispatchers
are competing for the same remote servers (e.g., when several users have
permission to access the same alfserver at the same time). However, even
without the maitre_d, independent chaos mode Alfreds will still function.
They will just race each other for resources, which adds a little overhead,
and the job priority scheme won't function: every dispatcher would appear
to have the same priority. See the scheduling
reference for more information.
The dispatcher saves job checkpoint files.
As job tasks complete, the dispatcher writes a checkpoint version of the
originally spooled worklist into the spooling directory. These files can
be used to restart a partially completed job from the state of the last
checkpoint (if, for example, the dispatcher is killed or its host crashes).
Every time a dispatcher starts it checks for unfinished jobs.
The worklist concept is very general.
Locally the worklist is a script. This means that RIB files can be generated
just before launching the netrender that actually needs them and completed
RIB files can be removed afterwards. Globally the worklist is like a makefile
or other dependency-based scheme. The dispatcher's major function is to
launch programs in a prescribed sequence taking advantage of parallelism
where possible. The actual programs to be launched are named in the worklist
and can essentially be arbitrary executables. Alfred does have some built-in
dispatching heuristics that have special knowledge of netrender and alfservers,
but these are optimizations for rendering and don't affect other types
of commands. Alfred also assumes that executables send a non-zero return
code if (and only if) an error occurs; this is not always true.
Worklist commands may be messages to
live
apps.
Alfred supports the concept of live, or persistent, programs that act on
messages sent to them over the course of the job.
The worklist language is straightforward.
Scripts consist of (nested) calls to a small number of operators (which
define the programs to execute), and an ordering scheme (typically a hierarchy
of parallel dependencies). See script syntax
for more information.
An Alfred-compliant app is easy to create.
To create and spool Alfred jobs, an application creates a worklist script
file and then launches "alfred filename", which automatically starts a
dispatcher and monitor (if they aren't running already) and spools the
job.
Applications that are launched from within Alfred scripts
are typically invoked as they would be from the shell command line. In
addition to simple one-time commands that process their argument list,
it is also possible to use stdio to send command messages to launched apps.
See the compliant application notes for
more details.
Various system resource definitions and behavior-tuning parameters are
defined in simple configuration scripts, which can be overridden on a per-host
and per-user basis. Temporary files are created in a spool directory, and
they persist until the job is removed. See the
manual
page for the location of these files.
We are sometimes asked about the meaning of the name "Alfred". While
the origin is shrouded in mystery, it has been said that Alfred is your
conscientious task master, an automaton's automaton, making sure that things
get done in an orderly and timely fashion, freeing you to attend philanthropic
dinner parties and wear rubber suits.
|