What is Alfred
- Alfred is a Scriptable Work Distribution System
- Provides Fully Integrated Network Rendering for
MTOR
- Automatically Handles Parallel Rendering using
Pixar's RenderManTM
- Easily Scriptable for Custom Applications
- User scripts specify what to launch,
and keywords which describe server requirements; the script
structure also describes a hierarchy of task dependencies.
- Alfred chooses where and when
to run the task, based on the best available remote servers,
task dependencies, and job priority.
- A fully integrated user interface allows users to easily track job
progress, detect errors, browse output logs, reprioritize and
delete jobs.
Alfred 4.0 - Major Enhancements
- Enhanced Server Assignment Algorithm
- Server Metrics Live network health data
can be used to make server assignment decisions.
- Preemption: The maitre-d, Alfred's server
assignment daemon, can now selectively interrupt long-running
low-priority jobs in favor of high priority interactive jobs.
- Site Configurable Assignment DSO: The server
assignment logic is now provided as a separate Assigner DSO/DLL.
This module can be modified locally using the provided source code.
- Stand-alone executable, Alfred no longer requires a bundled
Tcl/Tk distribution.
- Many of the modifications in the 4.0 release have been "behind the scenes"
and are focused on performance, scalability, and robustness.
- Alfserver
- Alfred's new companion remote execution server.
- Superset of nrmserver - the RenderMan Network
Rendering Server.
- Also provides the useful properties of rsh
and ralf for Alfred jobs, plus better
job termination and signaling capabilities.
- Allows Alfred jobs to remotely launch renderers and
other applications on Windows without additional
software.
- Monitors site-defined performance metrics and reports
them to the the Alfred maitre-d where they
can be used to make dispatching decisions.
Server Assignment DSO
Alfred 4.0 now allows sites to modify the default assignment policy
used by the maitre-d, or to create entirely new ones. The heart
of the assignment operation is implemented as a dynamically shared
object, which can be loaded from disk at run-time. The source code
for the default assigner is distributed to customers as part
of the RAT "devkit", which allows them to create site-specific
assigner DSOs, assuming that they have build tools for the SGI.
The code is in the file:
$RATTREE/devkit/examples/alfassigner/alfAssignFairness.cpp
The directory also contains an example Makefile and the interface
specification in the form of "AlfAssigner.h. The features
of this interface will grow in subsequent releases as we learn more
about the needs of our developers.
Server Metrics
Metrics are measurements of certain run-time system characteristics
such as currently available disk space or CPU usage. They are measured
by alfserver and reported to the maitre-d, among others.
Alfserver automatically collects several frequently requested metrics.
Sites may also define custom metrics and site-specific mechanisms for
collecting them.
When the maitre-d receives a server check-out request from a job, it can
use the current metrics to determine if a given server has sufficient
resources to carry out the request. The resource requirements for a given
command are specified in the Alfred job script. They are stated as a small
logical expression which returns either zero or non-zero. A non-zero
result is taken to mean that the system status is sufficient to bind the
given server to the requesting command.
Here's an example ...
##AlfredToDo 3.0
Job -title {metrics test} -subtasks {
Task {one} -cmds {
RemoteCmd {sleep 5 %s} -service {pixarNRM} \
-metrics {[Mget loadavg] < 1.25 && [Mget disk] > 1000}
}
}
So, the "-service" option should still be used to
pre-filter the list of candidate slots based on statically defined
slot keywords from the schedule file. The new "-metrics" option
specifies a bind-time expression similar to a ping.
The built-in Mget command does a lookup of the current
value of the named metric, "loadavg" and "disk" in the example above.
The built-in metrics reported by alfservers in the current release are:
loadavg |
the system load-average, normalized by number of CPUs. Some
typical values are: idle (0.01), busy (1.2), swamped (2.0) |
memory |
free RAM, in Megabytes. Does not include virtual memory. |
disk |
free disk, in Megabytes, available to non-superusers
on the partition containing "/" (or another partition). |
hostname |
the server hostname. |
Note that the metrics expression will be evaluated for each available slot
which matched the "-service" keyword criteria. The Mget command knows
how to determine the metrics associated with the current slot. So far
we only collect metrics on a per-host basis, and all of the slots defined
on that host share the same current values.
The alfservers determine which metrics to report, and how to acquire them,
via some new configuration settings in the alfred.schedule. NOTE: the
alfservers do not actually read the alfred.schedule file; instead, they
make a multicast request for the current configuration which is answered
by "the most current" maitre-d. They also currently multicast their
sampled metric values, the idea being that multiple parties on the
network might want to collect them.
The format for specifying which metrics should be reported, in the
alfred.schedule file, is as follows:
set Metrics(NAME)
{RefreshInterval AcquisitionMethod}
so, here's a typical configuration:
set Metrics(loadavg) {10 {builtin loadavg}}
set Metrics(memory) {10 {builtin memory}}
set Metrics(disk) {10 {builtin disk /usr/tmp}}
the NAME is the key by which scripts will refer to a metric
using the "Mget" command.
the RefreshInterval is the number of seconds that the alfserver should
wait before resampling/retransmitting a new value for a particular
metric.
the AcquisitionMethod describes how the alfserver should obtain a
particular metric value; the following three schemes are currently
supported:
builtin KEY - reports one of the automatically acquired metrics
exec STRING - reports the output of 'popen(STRING)'
constant VALUE - always reports VALUE for the named metric
Note that currently metrics are NOT limited to purely numeric values,
sites just need to write metrics expressions in their scripts that treat
string values correctly. Also, in the case of 'exec' above, only the
first line of output is reported as the metric value.
Large Site Considerations
Metrics can replace pings in many instances. This is
especially important for sites using alfred to manage a lot of
server slots or who are using "expensive" custom pings.
The ping mechanism can be a very useful way to qualify the validity
of a host for a particular command, and the ping occurs just
moments before the command is actually launched so it can provide
very accurate status checks. However, the downside of this
last-second pinging is that the maitre-d is required to wait
while the ping is executed, thereby temporarily interrupting
other dispatching activity.
The new metrics scheme allows server
state to be gathered asynchronously by the servers themselves
and reported to the maitre-d periodically. The maitre-d then
only performs a simple table look-up against the most recent
collected values, which can happen very quickly.
Both mechanism are supported in Alfred 4.0, and some thought
should be given to which type is appropriate for your site.
They can also be mixed: metrics can qualify some types of slots
and pings can qualify others.
Consider moving to metrics if at all possible. Also consider
disabling all pings, including the default basic ping
which tests whether a host is responding at all. The existence
of recent metrics is usually a completely satisfactory test for
basic server availability.
Features added to the Alfred Job Script Language
- RemoteCmd - specifies that the named
command should be executed on a remote server rather than locally.
This is similar to invoking rsh(1) via the standard Cmd
directive. For example, the following line is often found in
existing Alfred scripts:
Cmd {rsh %h appname} -service {appserver}
with Alfred 4.0 it can often be restated using the new directive:
RemoteCmd {appname} -service {appserver}
The Cmd directive is best suited for launching client applications
which will contact a remote server themselves, such as
netrender. However, it is sometimes useful to launch a command
remotely without any client part executing on the local (dispatching)
host.
RemoteCmd allows you to express this idea directly without needing
to launch even the lightweight client rsh (or the Alfred-supplied
rsh cover script ralf).
RemoteCmd communicates directly with alfserver which is
an execution server running on each remote host. Alfserver launches
the given command and tracks its state. Unlike rsh, RemoteCmd can
retrieve accurate exit-status for the command from alfserver.
- Cmd [...] -metrics {expr}
Server requirements for a given command can now include
expressions involving live server metrics. These can
help to restrict the choice of available remote servers to
those that meet certain RAM, load-average, disk space, or
other run-time criteria. See the
metrics discussion for more details.
Network Configuration: New Names and Well-Known Ports
Purpose |
Default |
Notes |
alfred maitre-d inbound requests |
TCP port 9000 |
Specify an alternate by changing the
"maitredPort" setting
in $RATTREE/etc/alfred.ini
|
alfred dispatcher inbound requests |
TCP port 9001 |
Since several dispatchers may be running on the same
host, this port is chosen dynamically. However 9001
is always tried first. The actual port is recorded
in the dispatcher lockfile (/tmp/.alfdp.*).
|
alfserver (nrmserver) netrender requests |
TCP port 1500 |
Specify an alternate in $RMANTREE/etc/netrman.ini
|
alfserver control protocol |
TCP port 1501 |
Alternate values can be specified in:
File Name |
Key Name |
/etc/services |
alfserver/tcp |
$RMANTREE/etc/netrman.ini |
/alfserver/port |
|
alfserver metrics multicasts |
239.255.224.99
UDP port 9002 |
Specify an alternate by creating an entry in /etc/hosts
or NIS/DNS maps for "alf-status.mcast.net" |
|
Features in HTTP Wrangler Mode
- Job script information can now be examined easily
using either the standard user interface or the web interface.
Details of job script settings, which are similar to "command guts"
at the job level, can be displayed using the new "Job script info"
menu entry or button. These include: the name of the spool file,
the job done/error commands, job clean-up commands, added keys and
tags, the job comment string, and other current status information.
- Filtering of idle dispatchers in wrangler-mode listings,
new settings on the wrangler-mode (HTML) form allow the list of
display dispatchers to be filtered by activity. Wranglers can set
a time threshold, such as "1 day", and only dispatchers which have
been active within the given period are displayed. Idle dispatchers
are listed on a separate page. The alfred.ini file has a new entry
which defines the initial threshold; all dispatchers are listed by
default.
- Choice of unrolled or summary wrangler listings,
a new checkbutton on the (HTML) wrangler form specifies whether
dispatcher listings are displayed using the previous "unrolled"
mode showing all jobs for all users, or in the summarized mode
showing just dispatcher names with the wait / active / done /
error counts.
- Automatic re-initialization of maitred-d name,
dispatchers (and nimby-mode desk utilities) which have become
disconnected from the maitre-d now automatically re-examine
the alfred.ini file before reconnecting to determine
if the list of maitre-d hosts has changed. This allows a
site to redefine the maitre-d list without restarting all
of the running dispatchers (and their jobs). The procedure is:
(1) modify the alfred.ini to contain the new list, (2) start
the maitre-ds which weren't in the old list, (3) kill -TERM
the old maitre-ds which aren't in the new list. The fallback
mechanism will automatically cause the dispatchers to read the
new list and search for the first available maitre-d.
- Overlay new distribution. The 4.0 release contains
a mechanism that might help some sites synchronize the installation
of future Alfred distributions. Please contact
Pixar Customer Support
for more details on this evolving "overlay mechanism".
Miscellaneous
- Screen-savers considered harmful. If you are using normal
workstations as servers (desktops at night, for example), remember
to use a very light-weight screensaver, such as a blank black screen.
Many of the most popular screensavers consume a lot of CPU power and
will interfere with job distribution. Be aware that on some systems
screensavers have been known to run even if the system doesn't
physically have a monitor!
- Windows Note: rsh.exe has problems prior to SP5,
there were several important fixes to the Windows rsh client in NT
Service Pack 5. Most notably, the SP4 rsh would correctly connect
to the specified remote host on the first invocation; however,
subsequent invocations which attempt to to connect to the same host
would hang and eventually time-out with an error. This bug causes
causing Alfred jobs to stall, if the job uses rsh or ralf
(which calls rsh indirectly). Affected sites should either
upgrade to SP5 or use another remote execution mechanism, such
as RemoteCmd and alfserver.
Warnings
The alfserver currently executes all commands as the
user who started the server. This has been considered
a feature for sites that have complained about the overhead
associated with rsh-style execution, which requires a complete
login and environment set-up for each command. However, it
is a potential security risk and therefore sites should strongly
consider starting alfserver from a user account which has very
strict limits on access to system resources.
Please contact Pixar customer support for more information.
|