Alfred Version 4.0 - Overview of Major Features

What is Alfred

Alfred is a Scriptable Work Distribution System
Provides Fully Integrated Network Rendering for MTOR
Automatically Handles Parallel Rendering using Pixar's RenderMan^TM
Easily Scriptable for Custom Applications
User scripts specify what to launch, and keywords which describe server requirements; the script structure also describes a hierarchy of task dependencies.
Alfred chooses where and when to run the task, based on the best available remote servers, task dependencies, and job priority.
A fully integrated user interface allows users to easily track job progress, detect errors, browse output logs, reprioritize and delete jobs.

Alfred 4.0 - Major Enhancements

Enhanced Server Assignment Algorithm
- Server Metrics Live network health data can be used to make server assignment decisions.
- Preemption: The maitre-d, Alfred's server assignment daemon, can now selectively interrupt long-running low-priority jobs in favor of high priority interactive jobs.
- Site Configurable Assignment DSO: The server assignment logic is now provided as a separate Assigner DSO/DLL. This module can be modified locally using the provided source code.
Stand-alone executable, Alfred no longer requires a bundled Tcl/Tk distribution.
Many of the modifications in the 4.0 release have been "behind the scenes" and are focused on performance, scalability, and robustness.
Alfserver
- Alfred's new companion remote execution server.
- Superset of nrmserver - the RenderMan Network Rendering Server.
- Also provides the useful properties of rsh and ralf for Alfred jobs, plus better job termination and signaling capabilities.
- Allows Alfred jobs to remotely launch renderers and other applications on Windows without additional software.
- Monitors site-defined performance metrics and reports them to the the Alfred maitre-d where they can be used to make dispatching decisions.

Server Assignment DSO

Alfred 4.0 now allows sites to modify the default assignment policy used by the maitre-d, or to create entirely new ones. The heart of the assignment operation is implemented as a dynamically shared object, which can be loaded from disk at run-time. The source code for the default assigner is distributed to customers as part of the RAT "devkit", which allows them to create site-specific assigner DSOs, assuming that they have build tools for the SGI. The code is in the file:

   $RATTREE/devkit/examples/alfassigner/alfAssignFairness.cpp

The directory also contains an example Makefile and the interface specification in the form of "AlfAssigner.h. The features of this interface will grow in subsequent releases as we learn more about the needs of our developers.

Server Metrics

Metrics are measurements of certain run-time system characteristics such as currently available disk space or CPU usage. They are measured by alfserver and reported to the maitre-d, among others.

Alfserver automatically collects several frequently requested metrics. Sites may also define custom metrics and site-specific mechanisms for collecting them.

When the maitre-d receives a server check-out request from a job, it can use the current metrics to determine if a given server has sufficient resources to carry out the request. The resource requirements for a given command are specified in the Alfred job script. They are stated as a small logical expression which returns either zero or non-zero. A non-zero result is taken to mean that the system status is sufficient to bind the given server to the requesting command.

Here's an example ...


 ##AlfredToDo 3.0
 Job -title {metrics test} -subtasks {
   Task {one} -cmds {
      RemoteCmd {sleep 5 %s} -service {pixarNRM} \
         -metrics {[Mget loadavg] < 1.25 && [Mget disk] > 1000}
   }
 }

So, the "-service" option should still be used to pre-filter the list of candidate slots based on statically defined slot keywords from the schedule file. The new "-metrics" option specifies a bind-time expression similar to a ping.

The built-in Mget command does a lookup of the current value of the named metric, "loadavg" and "disk" in the example above.

The built-in metrics reported by alfservers in the current release are:

loadavg	the system load-average, normalized by number of CPUs. Some typical values are: idle (0.01), busy (1.2), swamped (2.0)
memory	free RAM, in Megabytes. Does not include virtual memory.
disk	free disk, in Megabytes, available to non-superusers on the partition containing "/" (or another partition).
hostname	the server hostname.

Note that the metrics expression will be evaluated for each available slot which matched the "-service" keyword criteria. The Mget command knows how to determine the metrics associated with the current slot. So far we only collect metrics on a per-host basis, and all of the slots defined on that host share the same current values.

The alfservers determine which metrics to report, and how to acquire them, via some new configuration settings in the alfred.schedule. NOTE: the alfservers do not actually read the alfred.schedule file; instead, they make a multicast request for the current configuration which is answered by "the most current" maitre-d. They also currently multicast their sampled metric values, the idea being that multiple parties on the network might want to collect them.

The format for specifying which metrics should be reported, in the alfred.schedule file, is as follows:

set Metrics(NAME) {RefreshInterval AcquisitionMethod}

so, here's a typical configuration:

  set Metrics(loadavg)  {10 {builtin loadavg}}
  set Metrics(memory)   {10 {builtin memory}}
  set Metrics(disk)     {10 {builtin disk /usr/tmp}}

the NAME is the key by which scripts will refer to a metric using the "Mget" command.

the RefreshInterval is the number of seconds that the alfserver should wait before resampling/retransmitting a new value for a particular metric.

the AcquisitionMethod describes how the alfserver should obtain a particular metric value; the following three schemes are currently supported:

   builtin KEY     -  reports one of the automatically acquired metrics

   exec STRING     -  reports the output of 'popen(STRING)'

   constant VALUE  -  always reports VALUE for the named metric

Note that currently metrics are NOT limited to purely numeric values, sites just need to write metrics expressions in their scripts that treat string values correctly. Also, in the case of 'exec' above, only the first line of output is reported as the metric value.

Large Site Considerations

Metrics can replace pings in many instances. This is especially important for sites using alfred to manage a lot of server slots or who are using "expensive" custom pings.

The ping mechanism can be a very useful way to qualify the validity of a host for a particular command, and the ping occurs just moments before the command is actually launched so it can provide very accurate status checks. However, the downside of this last-second pinging is that the maitre-d is required to wait while the ping is executed, thereby temporarily interrupting other dispatching activity.

The new metrics scheme allows server state to be gathered asynchronously by the servers themselves and reported to the maitre-d periodically. The maitre-d then only performs a simple table look-up against the most recent collected values, which can happen very quickly.

Both mechanism are supported in Alfred 4.0, and some thought should be given to which type is appropriate for your site. They can also be mixed: metrics can qualify some types of slots and pings can qualify others.

Consider moving to metrics if at all possible. Also consider disabling all pings, including the default basic ping which tests whether a host is responding at all. The existence of recent metrics is usually a completely satisfactory test for basic server availability.

Features added to the Alfred Job Script Language

RemoteCmd - specifies that the named command should be executed on a remote server rather than locally. This is similar to invoking rsh(1) via the standard Cmd directive. For example, the following line is often found in existing Alfred scripts:
```
		Cmd {rsh %h appname} -service {appserver}
```
with Alfred 4.0 it can often be restated using the new directive:
```
		RemoteCmd {appname} -service {appserver}
```
The Cmd directive is best suited for launching client applications which will contact a remote server themselves, such as netrender. However, it is sometimes useful to launch a command remotely without any client part executing on the local (dispatching) host.
RemoteCmd allows you to express this idea directly without needing to launch even the lightweight client rsh (or the Alfred-supplied rsh cover script ralf).
RemoteCmd communicates directly with alfserver which is an execution server running on each remote host. Alfserver launches the given command and tracks its state. Unlike rsh, RemoteCmd can retrieve accurate exit-status for the command from alfserver.
Cmd [...] -metrics {expr}
Server requirements for a given command can now include expressions involving live server metrics. These can help to restrict the choice of available remote servers to those that meet certain RAM, load-average, disk space, or other run-time criteria. See the metrics discussion for more details.

Network Configuration: New Names and Well-Known Ports

Purpose Default Notes

alfred maitre-d
inbound requests TCP port 9000 Specify an alternate by changing the "maitredPort" setting in $RATTREE/etc/alfred.ini

alfred dispatcher
inbound requests TCP port 9001 Since several dispatchers may be running on the same host, this port is chosen dynamically. However 9001 is always tried first. The actual port is recorded in the dispatcher lockfile (/tmp/.alfdp.*).

alfserver
(nrmserver)
netrender requests TCP port 1500 Specify an alternate in $RMANTREE/etc/netrman.ini

alfserver
control protocol

TCP port 1501

Alternate values can be specified in:

File Name	Key Name
/etc/services	`alfserver/tcp`
$RMANTREE/etc/netrman.ini	`/alfserver/port`

alfserver
metrics multicasts 239.255.224.99
UDP port 9002 Specify an alternate by creating an entry in /etc/hosts
or NIS/DNS maps for "alf-status.mcast.net"

Features in HTTP Wrangler Mode

Job script information can now be examined easily using either the standard user interface or the web interface. Details of job script settings, which are similar to "command guts" at the job level, can be displayed using the new "Job script info" menu entry or button. These include: the name of the spool file, the job done/error commands, job clean-up commands, added keys and tags, the job comment string, and other current status information.
Filtering of idle dispatchers in wrangler-mode listings, new settings on the wrangler-mode (HTML) form allow the list of display dispatchers to be filtered by activity. Wranglers can set a time threshold, such as "1 day", and only dispatchers which have been active within the given period are displayed. Idle dispatchers are listed on a separate page. The alfred.ini file has a new entry which defines the initial threshold; all dispatchers are listed by default.
Choice of unrolled or summary wrangler listings, a new checkbutton on the (HTML) wrangler form specifies whether dispatcher listings are displayed using the previous "unrolled" mode showing all jobs for all users, or in the summarized mode showing just dispatcher names with the wait / active / done / error counts.
Automatic re-initialization of maitred-d name, dispatchers (and nimby-mode desk utilities) which have become disconnected from the maitre-d now automatically re-examine the alfred.ini file before reconnecting to determine if the list of maitre-d hosts has changed. This allows a site to redefine the maitre-d list without restarting all of the running dispatchers (and their jobs). The procedure is: (1) modify the alfred.ini to contain the new list, (2) start the maitre-ds which weren't in the old list, (3) kill -TERM the old maitre-ds which aren't in the new list. The fallback mechanism will automatically cause the dispatchers to read the new list and search for the first available maitre-d.
Overlay new distribution. The 4.0 release contains a mechanism that might help some sites synchronize the installation of future Alfred distributions. Please contact Pixar Customer Support for more details on this evolving "overlay mechanism".

Miscellaneous

Screen-savers considered harmful. If you are using normal workstations as servers (desktops at night, for example), remember to use a very light-weight screensaver, such as a blank black screen. Many of the most popular screensavers consume a lot of CPU power and will interfere with job distribution. Be aware that on some systems screensavers have been known to run even if the system doesn't physically have a monitor!
Windows Note: rsh.exe has problems prior to SP5, there were several important fixes to the Windows rsh client in NT Service Pack 5. Most notably, the SP4 rsh would correctly connect to the specified remote host on the first invocation; however, subsequent invocations which attempt to to connect to the same host would hang and eventually time-out with an error. This bug causes causing Alfred jobs to stall, if the job uses rsh or ralf (which calls rsh indirectly). Affected sites should either upgrade to SP5 or use another remote execution mechanism, such as RemoteCmd and alfserver.

Warnings

The alfserver currently executes all commands as the user who started the server. This has been considered a feature for sites that have complained about the overhead associated with rsh-style execution, which requires a complete login and environment set-up for each command. However, it is a potential security risk and therefore sites should strongly consider starting alfserver from a user account which has very strict limits on access to system resources.

Please contact Pixar customer support for more information.