- Codename Project BatCave
Alfred MySQL Logging and Web-based Database Browsing
The project codenamed "BatCave" is an effort to export Alfred status
and task execution history to a generic database in a form which allows
people to develop monitoring and analysis tools which are specific
to their needs and interests.
Alfred components, such as the dispatcher, maitre-d, and alfserver
have been modified to (optionally) log various data and events directly
to a MySQL database.
The database provides a highly programmable environment for creating
custom status queries and for doing historical analysis. The database
server can also offload the live status reporting/formatting function
performed by the maitre-d in "watch-servers" mode, which can improve
the maitre-d performance at large sites.
A php-based web interface for viewing the records in this SQL database
is also shipped as part of the project. The web interface provides a
broad range of basic functions and is fully customizable, so new
site-specific queries and reports can easily be added.
Several new alfred.ini settings control the basic database logging
capability. Bootstrap scripts are also provided for creating the
database tables and the web interface.
This capability has broad implications for integrating Alfred with
existing production databases, as well as with site control and
planning projects. Alfred features in this area will continue to
grow and evolve from the initial groundwork being released here.
For more details, see the Project BatCave
documentation.
Your job scripts can update MySQL records too --
In addition to the built-in Alfred data logging, Alfred scripts can
also add or modify records in the database. One approach is to simply
have the one of the job tasks invoke a command-line application that
makes the desired updates; this could be a custom application that
makes use of the MySQL API, or it could be a direct or indirect
invocation of the 'mysql' command-line client which comes with MySQL.
There are also new Alfred scripting options,
Job -sqlset and Task -sqlset
which can update the BatCave tables directly. The typical usage
would involve creating one or more site-specific columns in the
existing Job or Task tables; these custom columns
would then be updated with values from the job script when the
corresponding Job or Task record is created. One generic column
called "jobgroup" is already provided in the default
Job table, it is intended for arbitrary site use in this
way. For example:
Job -title "my job" \
-sqlset {jobgroup='shot-c57a'} \
-subtasks ...
The Cmd/RemoteCmd "launch expression" which defines the command-line
to be executed may contain several special "%" substitution macros
that are expanded by Alfred before the command is launched. There
are substitutions for the names of bound servers, etc. A few new
ones have been added which are especially useful when dealing with
the batcave databases:
- %J expands to the batcave MySQL Job "jid" for the current job.
- %j expands to the internal dispatcher job-id
for the current job. Note that these are not globally unique.
- %t expands to the Task "tid" for the current task.
While not globally unique, it is unique within the job, and is
used both internally by Alfred and in the batcave tables (as "tid").
- %c expands to the batcave MySQL Cmd "cmdid" for the current
Cmd/RemoteCmd.
So an example usage might be the following pointless, and somewhat recursive,
command:Cmd {mysql -h host -u user -D db \
-e {select commandline from Cmd where jid=%J \
and tid=%t and cmdid=%c}}
- New and Different settings in alfred.ini
Alfred administrators are encouraged to browse through the
new alfred.ini and perhaps "diff" it with the one
currently in use. There have been several changes and additions
which may be relevant. For example, there is now a way to limit
the number of output log records retained on disk for each task
in a job; this can help reduce log file sizes when tasks have
"run-away" diagnostics (like prints in shaders).
- Enable/Disable the "Watch Servers" and
"Master Schedule" menu items
There are new configuration settings in alfred.ini for
controlling whether the Watch Servers and
Master Schedule menu items are enabled or
disabled in the Alfred user interface. Large sites
may want to consider disabling Watch Servers
in particular since it can add considerable load to the
maitre-d process, when there are a lot of servers to
monitor. Also, many of the Project BatCave features (above)
are intended to provide similar or improved functionality.
Alfserver and the Alfred maitre-d "discover" each other on the network
using multicast packets addressed to a particular multicast "session"
address.
Having found each other, alfservers then deliver periodic status
updates called metrics to the maitre-d (and now also to a MySQL
database, see the BatCave discussion above). These metrics are
used as a basic measure of server health and the values can be
used to make specific server assignment decisions. Starting with
Alfred 6.5, metrics are reported to the maitre-d using point-to-point
"unicast" udp packets. In previous releases the metrics were also
multicast back to the discovery address, for use by potentially many
interested listeners. The unicast approach can reduce some network
overhead at large sites, especially in situations where the one-to-many
nature of multicast traffic causes problems for smart network
switches that try to optimize one-to-one communications.
Routers on the network ensure that the mulicast discovery messages
are delivered to all "subscribed" systems. By default, Alfred and
alfserver use the multicast "session" address 239.255.224.99, port
9002/udp. Sites can change this multicast address by adding the
hostname "alf-status" to the site nameserver (e.g. DNS, NIS,
/etc/hosts, etc), and picking a new multicast address for it from
the multicast range (224.0.0.0 - 239.255.255.255). Note that there
are IANA numbering conventions which apply to multicast addresses.
Alternatively, conventional "unicast" communications can be used
for both discovery as well as metrics delivery. This is done by
simply adding "alf-status" as a hostname alias for the maitre-d
host's regular IP address, rather than using a multicast address.
This approach is actually a way to bypass the "discovery" phase.
The alfserver metrics will be sent as standard UDP packets directly
to the named maitre-d. Note that this approach should not be used
with fallback maitre-ds, since alfservers would only know about
the one named host, and metrics would only be delivered to that
one host.
A new alfserver configuration setting, "metricsDelivery" can now
be set to "multicast" to force metrics to be sent to the multicast
address (as in releases prior to 6.5), so that any other interested
listeners can receive them simultaneously.
There is also a new way to deliver configuration overrides to all
alfservers from the maitre-d: Create a file called
$RATTREE/etc/alfsite.ini containing the overrides in
an ini location accessible to the maitre-d, its
contents will be sent by the maitre-d to the alfservers as part of
the discovery process, along with the site metrics definitions.
- New Task Menu Item: Try this task next
There is a new item that appears on the Task menu when you click
on a particular task's box in the job diagram window. This new
entry allows you to request that the given task should be dispatched
next, if possible. This simply changes the local dispatcher's
&next task& logic and does not affect the actual job
priorities relative to jobs from other dispatchers. This action
is only available on tasks that are "Ready" to execute.
The "inner loop" of the maitre-d server assignment algorithm has
been changed. The new code is both more uniform (there's only
one assignment entry point), and more accurate in the face of
a wide ranging mix of incoming request types and frequencies.
The assigner code is also provided in source-code form, as
has been true in prior releases, so sites can create an
"assigner plug-in" that implements an alternative set of policies.
Note: there is currently no backward compatibility support for
assigner plug-ins written for prior releases. Sites that have
such plug-ins will need to port the relevant changes to the new
algorithm. The existence of old plug-ins will not cause errors,
since the maitre-d will just fall back to using the default
built-in scheme. It distinguishes new plug-ins by searching for
the new, versioned, name of the assigner object factory method.
- Improved the handling of assignment requests, see above.
- Improved the load-balancing among jobs on the local dispatching
queue when in "job parallel" mode.
- Connections to remote Alfred dispatchers, using "alfred -h user@host",
now use the site maitre-d, if available, to determine the remote connection
port. The prior use of 'rsh' for this purpose, while nominally
somewhat more secure, added unnecessary complexity at most sites and
is increasingly unlikely to work as rsh support dwindles. A new ini
setting (rshForDispatcherDiscovery) can be used to restore the old
behavior.
- Certain "alfserver not responding" situations are now handled
more correctly, and those servers are more consistently taken out of the
assignment pool for a period specified by "timerAvoidNoListener".
- A bug was fixed in the handling of Alfred "maitredHost" lists in
configuration files other than $RATTREE/etc/alfred.ini, such as found
via $RAT_SCRIPT_PATHS. If the primary maitre-d went offline, dispatchers
using the alternate configuration file locations would sometimes end
up in "chaos mode" (using a private, local, maitre-d), and then be
unable to reconnect to the main maitre-d when it came back online.
- Fetching task output logs with lines longer than 1024 characters
sometimes failed due to faulty encoding for transmission. This
has been fixed.
- Support for handling log files greater than 2GB in size has been
enabled on Linux systems. This should fix problems loading
existing job checkpoints for large jobs, and address crashes
or other misbehavior when logs grew about 2GB.
- A new alfred.ini setting, "maxTaskOutput",
limits the number of records logged on a per-task basis.
Some problems with large task output logs can be avoided
if no individual task is allowed to log more than 5000
records, for example.
- Upon receipt of SIGHUP on unix-style systems,
the Alfred dispatcher and maitre-d now explicitly close and
reopen their diagnostic log files (as specified by the
"-log filename" command-line option).
This should allow them to interoperate better with log
rotation facilities such as logrotate
on Linux systems.
- Fixed some cases where paths containing blank spaces would cause
problems for launching certain applications.
- Fixed several problems involving Alfserver's handling of RMANCONFIG.
- Better temporary file names for Alfred jobs and logs are now chosen,
to prevent occasional collisions on some systems.
- An issue with retrying preflight tasks in a job that caused Alfred
to crash has been fixed.
- Several problems related to "skipping" subtasks of a "shared server"
parent task have been fixed.
- The timerMaitredQueue setting from alfred.ini is now obeyed properly.
- Alfred's Help menu now uses the HelpURLs set of preferences.
- There are several minimum version requirements for the servers
used for the "BatCave" functions (e.g. php, mysql, alfserver).
See the
Project BatCave documentation for details.
- Alfserver support for scriptable, key-based, per-command,
environment configuration was recently extended to include
support for "netrender -R key ..." This includes the ability
to select which user will own the resulting prman process.
Currently, the alfserver ownership mode "login" is not a viable option
for netrender connections. This is because netrender and prman exchange
some of their data on sockets which are connected to stdin and stdout;
these connections do not survive the login set-up. The alternative
"setuid" mode works as expected, and is frequently a better choice
anyway, from an administrative point of view. Note that "login"
continues to be supported for RemoteCmd usage, although again,
"setuid" is usually the more manageable choice.
Bug Fixes
- An intermittent problem spooling new rendering jobs to Alfred
from within Maya on Mac OS X as been fixed.
- A problem which prevented Alfred from being able to open
"../resources/alfred.brt" on Mac OS X when installed remotely on
a case-sensitive file system has been addressed.
- An issue with unicast metrics delivery interruption from Windows
alfservers that occured when the maitre_d shut down has been fixed.
Note that the full fix for this problem requires an updated
alfserver.exe (12.5.2 and beyond).
- An problem causing alfserver to sometimes repeatedly request
alfsite.ini from the maitre-d has been fixed.
- An issue with the envkey settings in alfserver.ini on Windows
that prevented establishing a correct RATTREE has been fixed.
- Job queries by hostname on the BatCave's main page now execute properly.
- A potentially serious bug has been fixed regarding the way in
which very long metrics definitions are buffered during transmission.
Enhancements
- Threaded metrics processing The Alfred maitre-d can now be configured to handle inbound metrics reports from Alfservers using a separate thread within the maitre-d process. See the alfred.ini setting "metricsReceiveThreaded" for the configuration information. This option should be considered at sites where the maitre-d's slot assignment throughput is diminished by large numbers of inbound metrics reports. (NOTE: This is for Linux and OS X only.)
- New low-latency pings have been implemented for the Alfred maitre-d. This allows sites that are using metrics to rely on them as preflights for costly pings. See the Low-Latency Pings discussion for details.
Bug Fixes
- A bug in the "maitre-d initializing" wait period has been fixed. An incorrect conditional test allowed some slot assignments to occur before the entire initial-wait period had expired.
- A bug that caused the Alfred maitre-d to crash due to a dispatcher submitting "reconstruct current state" messages before disconnect messages from the prior instance of that dispatcher were fully processed has been fixed.
|