The alfred.schedule file defines several server ping commands. These commands are run by the maitre_d as the last step in binding a remote server to a requesting dispatcher. Each entry defines a specialized "pre-flight" inquiry which is specific to a particular types of remote server (server in this context means the combination of a host and an application program which can do work for remote requestors).
By default, the maitre_d will test whether a remote host is reachable on the network before binding it to a dispatcher, thus avoiding execution errors related to machines which are down, etc. However, there are numerous other situations in which a server may not be ready to receive new work; the alfred ping mechanism provides a way to make site and server-specific queries of arbitrary servers.
Note: Dispatching decisions which need to be based on "live" data about server state can sometimes be more efficiently handled using server metrics (new in Alfred 4.0). Metrics are site-defined values measured and reported periodically by alfserver, which is an optional component of the RenderMan Artist Tools.
Since alfred is frequently used to distribute netrender jobs it has a built-in alfserver query which makes a low overhead check of alfservers to determine:
The schedule file defines a table of server types and their associated ping commands (described below). The maitre-d picks the appropriate entry based on the incoming server request from a dispatcher. Consider a job in which there are Tasks that contain the following commands with service expressions:
Cmd {netrender -f -Progress %H some.rib} -service {pixarNRM} [...] Cmd {tar cf %h:/dev/tape /tmp/results} -service {tapeserver}The first command will launch a netrender of the RIB file as soon as the maitre-d can deliver the name of an available rendering server of the type "pixarNRM". The maitre-d will look through the current schedule data to find a service whose Selection Keys contain the word "pixarNRM" and which is currently not assigned to another dispatcher. An example service entry might be:
Similarly, the second (somewhat contrived) command is waiting for a service which matches the key "tapeserver" before it can copy the file to a remote tape.
So, if we wanted to define ping commands for these two types of servers, we'd add schedule file ping table entries for pixarNRM and tapeserver, for example:
The command "Alfping" is the built-in ping described below; otherwise the entry is assumed to be the invocation of an executable (binary or script), it must exit with a return code of zero when the named server is ready to accept new work, non-zero otherwise. Ideally it will be a low-overhead program to run since it will be called potentially many times during the course of a job. The first line of output returned (on either stderr or stdout) will be used as a status msg in the "watch servers" dialog when a server is unavailable.
The following tokens cause run-time substitutions:
If there is no matching ping entry for a particular key, then the entry labeled "default" is used.
If several keys match, for example if the -service expression is "{pixarNRM | pixarRAT}" then a list of the unique ping commands is constructed, and each must exit successfully for the server to be considered ready.
The built-in command Alfping handles several common cases: the simple query "is the host responding?", and common query "is the Pixar alfserver available?", and some low-latency variants. The calling syntax is:
Alfping tcp hostname lag port Alfping nrm hostname lag loadavg memory Alfping nrm~ hostname lag loadavg memory Alfping metrics hostname {[Mget metric] expr...}The hostname is the host to query, typically this is "%h" or "%i" in the ping definition.
The lag parameter is the number of seconds that the maitre-d should wait for a response before declaring a server temporarily unreachable.
The port parameter defines the TCP test-port for tcp mode, see below.
The load threshold tests the remote server's CPU load-average, it will skip servers whose load is above the given value.
The memory parameter is the minimum available memory (Mbytes) required to consider a server available (0 implies: ignore memory use).
If the first argument is "tcp" then Alfping just performs the basic ping query which tests whether a host is responding on the network. It does this by initiating a TCP connection to the specified port and immediately disconnecting. The port should be the number of an unused TCP port, the denial of service by the remote host (inetd) is a both a low-overhead ping, and an indication of some "higher brain" function on the host. See /etc/services (or ypcat services) for a list of reserved ports. Note that alfred does not broadcast ICMP packets, as ping(1) does, to query the host; besides being somewhat antisocial, ICMP packets are often answered directly by the system's networking hardware and may not indicate that the operating system is really functional; this is beginning to be true of TCP service mapping as well, but it's not typical on most compute servers. By attempting to connect to an unused port, alfred can distinguish between "connection refused" messages and "host not responding"; for the purposes of a simple ping, the former is good, the latter is bad.
If the first argument is "nrm" then Alfping uses a netrender server query to gather basic status information from the target host. When a Pixar alfserver process is running on the server, it will return system usage data which can be used by the maitre-d in its qualifying threshold comparisons. This nrm query also tells alfred whether all the rendering slots are currently in use, even if they weren't assigned through alfred. (Alfserver slots, in this sense, are just the number of parallel tasks that an alfserver will allow on its host, which is usually equal to the number of CPUs on the host, or as defined by the alfserver -a option, or in $RMANTREE/etc/rendermn.ini)
Low-latency pings
The "nrm~" variant is a "metrics approximation" of a full nrm ping. The ping immediately fails and doesn't run at all if there are no recent metrics from the host. Furthermore, the loadavg and memory threshold tests are performed first using current metrics data, rather than fetching them from the alfserver. If these pass, then a lower-than-usual-overhead nrm socket query is made to get the free slot count, which may be valuable at some sites since "rogue" netrenders may be using servers.
The Alfping metrics built-in allows arbitrary metrics expressions as a ping. Thus a single metrics expression can be used to qualify all slots using a shared schedule key. This complements the existing "Cmd -metrics expr" construct which is finer grained and requires modifications to the job script itself. As a ping, these metrics expressions can be very low latency because everything is drawn from the current metrics table, no sockets are created, etc. For example:
Alfping metrics %i {[Mget loadavg] < 1.5}
Note that the term load average used above refers to the average length of a system's run-queue (recently unserviced processes). The uptime(1) command prints the recent load-averages for the last 1, 5, and 10 minute periods, xload(1) will show a running graph of the system load. Alfserver reports the one-minute load-average. Also, the returned value is normalized to account for multiple processors. Typical values for an "idle" system (running just normal background processes and terminal activity) will be between 0.0 and 0.25, busy systems will typically return numbers near 1.0, numbers greater than 2.0 often indicate an "overwhelmed" system. Load-average is not the same thing as the instantaneous CPU usage displayed by top(1) or other tools, and since it is a time average, it can decay slowly after a CPU-intensive process exits (several seconds or more). Unfortunately, load-average does not always reflect other performance bottlenecks such as paging, file access, or network traffic.
PRMan license checks
Normally if alfred is the only thing launching rendering jobs, then the maitre-d's information about free rendering servers will correspond to reality. However, sometimes netrender jobs are started from other sources. So alfred checks the available slot count to make sure a server is really available. Similarly, in "nrm" mode, alfred will attempt to confirm that a PRMan license is available before actually launching the netrender; again a license is sometimes consumed by a process which is not being tracked by alfred.
If this license availability test fails, then the server checkout is denied and the dispatcher will try again later. The Alfred job window will display the LicenseMax(prman) message as long as the availability test is failing. Occasionally, there can be other reasons for this test to fail besides all the licenses being in use by renderers. For example, if the license server is down or inaccessible from the maitre-d host, or if the shell environment from which the maitre-d was launched specifies the wrong license.dat file (via some combination of RMANTREE or rendermn.ini).
Note: in unusual cases you can disable the available license check during the ping by appending an additional 0 (zero) after the other Alfping nrm parameters. An additional non-zero parameter indicates that the ping should check for the availability of a particular PRMan version, rather that the total of all license versions. Here are three variations on a ping definition which illustrate the use of this parameter:
Alfping nrm %h 2 1.25 0 | ... tests for PRMan license availability | ||
Alfping nrm %h 2 1.25 0 0 | ... disables the PRMan availability test | ||
Alfping nrm %h 2 1.25 0 3.900 | ... tests for PRMan 3.900 licenses only |
Pixar Animation Studios
|