A zombie is a running job that fails authentication when communicating with the ecflow_server
How are zombies created ?
There are wide variety of reasons why a
zombie is created.
The most common causes are due to user action:
- The node tree is deleted, replaced or reloaded whilst jobs are running
- A task is rerun, whilst in a submitted or active state
- A job is forced to new state, i.e. complete
More rarer causes might be:
How can zombie’s be handled ?
The default behaviour for init, complete, abort and wait child commands, is to block the job, and for event, label, meter to continue(fob).(With fob, the task no longer block, but the server will not change event, meter,labels)
There are two environment variables that control how ecflow_client handles wait times when trying to connect to the server .
- ECF_TIMEOUT This defines the maximum time the client will wait for any child command. It is specified in seconds. The default value is 24 hours. See ecflow_client.
- ECF_ZOMBIE_TIMEOUT This is applied to zombies only. It is specified in seconds. The default value is 12 hours. This would apply for each zombie init, abort and complete in the script.
When any of above timeout is exceeded, ecflow_client exits with a failure. Depending on your script, this can be caught by a trap,
which will typically call abort child command, this again can wait for 12/24 hours before exiting the process.
Hence it is worth considering if this is appropriate behaviour for your system.
The jobs can also configured, so that if the server denies the communication, then
(This can be done setting the environment variable ECF_DENIED in your scripts. See
ecflow_client)
This can be useful to detect network issues early.
ecflow_ui provides a tab which lists all the zombies and the actions that can be taken.
The zombies tab is shown, in the info panel when the server node( i.e. top most) is selected. |
The actions include:
Of the four action above, only Rescue will allow child command to change the state of the node tree. |
What to do
- Create a zombie by starting a task, and setting it to complete immediately via ecflow_ui
- Inspect the log file, it will show you how the zombie has arisen.
- Inspect the zombie tab in ecflow_ui (select the host node, then select the zombies tab)
- Experiment with the different actions on the zombie
Since the default ECF_ZOMBIE_TIMEOUT is 12hr, change this to 1 minute, by editing your head.h.
export ECF_ZOMBIE_TIMEOUT=60 # specified in seconds |