Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Horizontal Navigation Bar


Button Group

Button Hyperlink
titlePrevious
typestandard
urlhttps://confluence.ecmwf.int/display/ECFLOW/Introducing+Zombies
Button Hyperlink
titleUp
typestandard
urlhttps://softwareconfluence.ecmwf.int/wiki/display/ECFLOW/Advanced+Topics
Button Hyperlink
titleNext
typestandard
urlhttps://confluence.ecmwf.int/display/ECFLOW/Exercises


Manual

When zombie s arise they can be handled manually by ecflow_ui. (See Zombie) or via the command-line interface:

Code Block
languagebash
titlezombie commands
ecflow_client –-zombie_get                # This will list all the task/jobs the server thinks are zombies.
ecflow_client --zombie_kill=<task-path>   # Ask the server kill the zombie process. Use ECF_KILL_CMD
ecflow_client –-zombie_fail=<task-path>   # Ask the zombie to fail. This may result in another zombie because abort child command in the job, will be called.
ecflow_client –-zombie_fob=<task-path>    # Used to unblock the child, allows the job to proceed. However this will only work for zombies where the password does not match.
ecflow_client –-zombie_adopt=<task-path>  # Copies the password stored on the zombie onto the task. Allows the job to proceed, and update the state in the server
                                          # ( i.e. due to init,complete,abort).  
                                          # It is up to the user, to ensure that the zombie has been dealt with  before doing this.
ecflow_client –-zombie_remove=<task-path> # Remove the zombie representation in the server. Typically this is done, when we are sure we have handled the zombie. 
                                          # The zombie will re-appear next time it communicates with server, if this is not the case.
ecflow_client –-zombie_block =<task-path> # Ask the jobs to block at the child command in the job. Prevents the job from proceeding. 
                                          # (This is the default behaviour for the init, complete and abort child commands)

...

ecflow_client –zombie_adopt=<task-path>, will not allow this, due to the potential for data corruption. 

In this case, the normal behaviour would be kill both processprocesses, and re-queue the task.


In the extreme, we can by pass bypass the authentication. (i.e. allowing the request to be handled by the server).

...

After the job has completed, be sure to delete this variable. Otherwise, if zombies arise again, there is a considerable risk of data corruption.

...

It is also possible to ask ecflow_server to make the same response in an automated fashion. How ever However, very careful consideration should be made before doing this. Otherwise, it could mask a serious underlying problem.

The automated response can be defined statically using python and text interface or dynamically (add/remove) via alter.:

  • python interface( See ecflow.ZombieAttr)

  • text interface ( See Definition file Grammar)

    zombie             ::=  "zombie" >> `zombie_type` >> ":" >> !(`client_side_action` | `server_side_action`) >> ":" >> *`child` >> ":" >> !`zombie_life_time`
    zombie_type        ::=  "user" | "ecf" | "path" | "ecf_pid" | "ecf_passwd" | "ecf_pid_passwd"
    child              ::=  "init" | "event" | "meter" | "label" | "wait" | "abort" | "complete" | "queue"
    client_side_action ::=  "fob" | "fail" | "block"
    server_side_action ::=  "adopt" | "delete | "kill"
    zombie_life_time   ::=  unsigned integer( default:  user(300), ecf(3600), path(900)  ), the server poll timer runs every 60 seconds, hence this is the effective minimum value
    Where:

       ecf_pid                -  PID miss-match, password matches. Job scheduled twice. Check submitter

       ecf_pid_passwd - Both PID and password miss-match. Re-queue & submit of the active job?

       ecf_passwd        - Password miss-match, PID matches, system has re-cycled PID or hacked job file?

       ecf                      - Two init commands or task complete or aborted but receives another child cmd

       ecf_user             - Created by user action

       ecf_path            - Task not found. Nodes replaced whilst jobs were running


  • --alter command(dynamic)
         ecflow_client --alter add zombie <zombie-attribute>  <path>
         ecflow_client --later delete zombie < ecf | path | user>  <path>
    However note, the effect will only be seen, when the child command, makes the next attempt to communicate with the server.

The zombie attribute is inherited in the same manner as Variable inheritance.

Example: For tasks under suite “s1” add a zombie attribute, such that child label commands(i.e.. ecflow_client –label) never blocks the job: (not strictly needed as this is the default behaviour)

...

Horizontal Navigation Bar


Button Group

Button Hyperlink
titlePrevious
typestandard
urlhttps://confluence.ecmwf.int/display/ECFLOW/Introducing+Zombies
Button Hyperlink
titleUp
typestandard
urlhttps://softwareconfluence.ecmwf.int/wiki/display/ECFLOW/Advanced+Topics
Button Hyperlink
titleNext
typestandard
urlhttps://confluence.ecmwf.int/display/ECFLOW/Exercises


...