Metrics - Server Performance and Availability Information


Overview

Alfred server metrics are measurements of certain run-time system characteristics such as currently available disk space or CPU usage. They are measured by alfserver and reported to the maitre-d, among others.

Alfserver automatically collects several frequently requested metrics. Sites may also define custom metrics and site-specific mechanisms for collecting them.

Note: Metrics are an enhancement to the older maitre-d server ping mechanism. Metrics provide enhanced functionality by allowing scripts to specify per-Cmd server selection criteria based on live data which can be site-specific. Metrics also provide increased performance over the ping mechanism since metrics are acquired asynchronously by alfservers and then collected by the maitre-d. The values are then accessed using simple table look-ups rather than active pinging during the server selection process.

When the maitre-d receives a server check-out request from a job, it can use the current metrics to determine if a given server has sufficient resources to carry out the request. The resource requirements for a given command are specified in the Alfred job script. They are stated as a small logical expression which returns either zero or non-zero. A non-zero result is taken to mean that the system status is sufficient to bind the given server to the requesting command.

Here's an example ...


 ##AlfredToDo 3.0
 Job -title {metrics test} -subtasks {
   Task {one} -cmds {
      RemoteCmd {sleep 5 %s} -service {pixarNRM} \
         -metrics {[Mget loadavg] < 1.25 && [Mget disk] > 1000}
   }
 }
So, the "-service" option should still be used to pre-filter the list of candidate slots based on statically defined slot keywords from the schedule file. The new "-metrics" option specifies a bind-time expression similar to a ping.

The built-in Mget command does a lookup of the current value of the named metric, "loadavg" and "disk" in the example above.  

Built-in Metrics

The built-in metrics reported by alfservers in the current release are:

loadavg the system load-average, normalized by number of CPUs. Some typical values are: idle (0.01), busy (1.2), swamped (2.0)
memory free RAM, in Megabytes. Does not include virtual memory.
disk free disk, in Megabytes, available to non-superusers on the partition containing "/" (or another partition).
hostname the server hostname.
platform a brief identification of the host and operating system.
uptime time since system boot.

Note that the metrics expression will be evaluated for each available slot which matched the "-service" keyword criteria. The Mget command knows how to determine the metrics associated with the current slot. So far we only collect metrics on a per-host basis, and all of the slots defined on that host share the same current values.

The alfservers determine which metrics to report, and how to acquire them, via some new configuration settings in the alfred.schedule. NOTE: the alfservers do not actually read the alfred.schedule file; instead, they make a multicast request for the current configuration which is answered by "the most current" maitre-d. They also currently multicast their sampled metric values, the idea being that multiple parties on the network might want to collect them.

Metrics should be defined as follows, in alfred.schedule and/or alfserver.ini:

    DefineMetric NAME -type args -interval nnn

Here are some examples:
	DefineMetric {memory} -builtin {memory}    -interval 10
	DefineMetric {disk}   -builtin {disk /tmp} -interval 10
	DefineMetric {RH} -exec {awk '{print $12}' /proc/version} -interval 1000

 

An older form is also supported, although this is deprecated and will be removed in some future release:

    set Metrics(NAME)   {RefreshInterval   AcquisitionMethod}

so, here's a typical configuration:

  set Metrics(loadavg)  {10 {builtin loadavg}}
  set Metrics(memory)   {10 {builtin memory}}
  set Metrics(disk)     {10 {builtin disk /usr/tmp}}

Note that currently metrics are NOT limited to purely numeric values, sites just need to write metrics expressions in their scripts that treat string values correctly.

 

Large Site Considerations

Threaded metrics handling:   As of alfred 6.5.2, it is possible to allow inbound metrics messages from alfservers to be handled by a separate thread within the maitre-d process. See the alfred.ini setting "metricsReceiveThreaded" for configuration information. This option should be considered at sites where the maitre-d's slot assignment throughput is hampered by large numbers of inbound metrics reports.

Metrics can replace pings in many instances. This is especially important for sites using alfred to manage a lot of server slots or who are using "expensive" custom pings.

The ping mechanism can be a very useful way to qualify the validity of a host for a particular command, and the ping occurs just moments before the command is actually launched so it can provide very accurate status checks. However, the downside of this last-second pinging is that the maitre-d is required to wait while the ping is executed, thereby temporarily interrupting other dispatching activity.

The metrics scheme allows server state to be gathered asynchronously by the servers themselves and reported to the maitre-d periodically. The maitre-d then only performs a simple table look-up against the most recent collected values, which can happen very quickly.

Both mechanisms are supported, so some thought should be given to which type is appropriate for your site. They can also be mixed: metrics can qualify some types of slots and pings can qualify others.

Consider using "metrics-assisted" low-latency pings, if possible.

If nameserver lookups (DNS/NIS/etc) are slow, consider using the "%i" hostname substitution rather than "%h" which can cause a lookup when the ping command is invoked, which may be many times a second.

Consider moving to metrics if at all possible. Also consider disabling all pings, including the default basic ping which tests whether a host is responding at all. The existence of recent metrics is usually a completely satisfactory test for basic server availability. Pings are disabled using an empty string {} as the definition; NOTE: do not simply remove or comment out one of the RAT default definitions to disable it, because that will cause the hard-coded default to be used instead.

 

Pixar Animation Studios
(510) 752-3000 (voice)   (510) 752-3151 (fax)
Copyright © 1996- Pixar. All rights reserved.
RenderMan® is a registered trademark of Pixar.