ecFlow's documentation is now on readthedocs!

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 23 Next »

Manual

When zombie s arise they can be handled manually by ecflowview. (See Zombie) or via the command line interface:

  • ecflow_client –zombie_get
  • ecflow_client –zombie_fail <task-path>
  • ecflow_client –zombie_fob <task-path>
  • ecflow_client –zombie_adopt <task-path>
  • ecflow_client –zombie_remove <task-path>
  • ecflow_client –zombie_block <task-path>

Automated

It is also possible to ask ecflow_server to make the same response in an automated fashion. How ever very careful consideration should be made before doing this. Otherwise it could mask a serious underlying problem.

The automated response can be defined statically using python and text interface or dynamically (add/remove) via alter.:

  • python interface( See ecflow.ZombieAttr)

  • text interface ( See Definition file Grammar)

    zombie             ::=  "zombie" >> `zombie_type` >> ":" >> !(`client_side_action` | `server_side_action`) >> ":" >> *`child` >> ":" >> !`zombie_life_time`
    zombie_type        ::=  "user" | "ecf" | "path"
    child              ::=  "init" | "event" | "meter" | "label" | "wait" | "abort" | "complete"
    client_side_action ::=  "fob" | "fail" | "block"
    server_side_action ::=  "adopt" | "delete"
    zombie_life_time   ::=  unsigned integer  ( default:  user(300), ecf(3600), path(900)  )
  • --alter command(dynamic)
         ecflow_client --alter add zombie <zombie-attribute>  <path>
         ecflow_client --later delete zombie < ecf | path | user>  <path>
    However note, the effect will only be seen, when the child command, makes the next attempt to communicate with the server.

The zombie attribute is inherited in the same manner as Variable inheritance.

Example: For tasks under suite “s1” add a zombie attribute, such that child label commands(i.e ecflow_client –label) never blocks the job: (not strictly needed as this is the default behaviour from release 4.0.5 onwards)

  • python

    s1 = ecflow.Suite('s1')
    child_list = [ ChildCmdType.label ]
    zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
    s1.add_zombie(zombie_attr)
    
  • text

    suite s1
       zombie ecf:fob:label:
  • alter
         ecflow_client --alter add zombie "ecf:fob:label:"  /s1

Example: For tasks under suite “s1” add a zombie attribute, such that job that issues the child commands( event, meter, label) never blocks: (not strictly needed as this is the default behaviour from release 4.0.5 onwards)

  • python

    s1 = ecflow.Suite('s1')
    child_list = [ ChildCmdType.label, ChildCmdType.event, ChildCmdType.meter ]
    zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
    s1.add_zombie(zombie_attr)
    
  • text

    suite s1
       zombie ecf:fob:label,event,meter:
  • alter
         ecflow_client --alter add zombie "ecf:fob:label,event,meter:"  /s1

Example: For all tasks under family “critical”, if any zombies arise then fail the job:

  • python

    with ecflow.Suite('s1') as s1:
       with s1.add_family("critical") as crit :
          child_list = [ ]  # empty child list means apply to all child commands
          crit.add_zombie(ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fail, 300))
          crit.add_zombie(ZombieAttr(ZombieType.path, child_list, ZombieUserActionType.fail, 300))
          crit.add_zombie(ZombieAttr(ZombieType.user, child_list, ZombieUserActionType.fail, 300))
    
  • text

       suite s1
         family critical
           zombie ecf:fail::
           zombie path:fail::
           zombie user:fail::
  • alter
        ecflow_client --alter add zombie "ecf:fail::"    /s1
        ecflow_client --alter add zombie "path:fail::"  /s1
        ecflow_client --alter add zombie "user:fail::"  /s1

 

Here are some further example of using --alter:

  • ecflow_client --alter add zombie "ecf:fob::"   /suiteX          # fob (init,event, meter, label,abort, complete) child commands.  This prevents zombies from blocking the script. Use with great care.
  • ecflow_client --alter add zombie "ecf:fail::"   /suiteY          # fail the script straight away for any child command, in the job file.

You can only add one zombie attribute of each time(ecf,path,user).

To delete a zombie attribute, please use one of:

  • ecflow_client --alter delete zombie ecf     /suiteX
  • ecflow_client --alter delete zombie path   /suiteX
  • ecflow_client --alter delete zombie user  /suiteX

Here are some more examples:

  • Add a zombie attribute, that kills the zombie process automatically when  a init/complete child is recieved by the server. This will use whatever is defined for ECF_KILL_CMD

       ecflow_client --alter add zombie "ecf:kill:init,complete:" /suiteZ

  • Add a zombie automatically kills zombies process, created out of user action.

             ecflow_client --alter add zombie "user:kill::" /suiteZ

  • Add a zombie attribute that adopts all child complete zombies.

       ecflow_client --alter add zombie "ecf:adopt:complete:" /suiteZ

Semi-Automated

Sometimes zombies can arise for more obscure reasons. i.e The job sends a --init message to the server, meanwhile the server is busy(i.e processing jobs), when finally the server makes the task active, and sends a message back to the client/job the ecflow_client has timed out. This causes the ecflow_client to send the same message again. However this time the server treats the child command as a zombie, since the task is already active. Hence we get these false zombies.

These scenario's are very rare, but tends to happen, for the following situations:

  • High disk latencies  (i.e  Check pointing takes a lot of time, or job processing take to long. Typically happens when using virtual machines, with non local data)
  • very large scripts ( i.e in the megabytes), this can inflate the server memory, and cause job processing to take longer.
  • Extremely large definitions, which are requested by many users, via the GUI. (  The download size, can be reduced, by only requesting the suite you are interested in)
  • Very busy machine and/or not enough memory available. ( i.e. server is competing for the resources)
  • Server is overloaded. ( this can  be visualised if you have gnuplot installed, and available on $PATH,  i.e invoke  ecflow_client --server_load=<path to the log file> )

To diagnose these cases, we need to look at the log file. Typically you will see two or more child commands (--init/complete), where the second will then be treated as a zombie.

To get round these issue you can add a variable ECF_NONSTRICT_ZOMBIES, which will reduce these false zombies.

       ecflow_client --alter add variable ECF_NONSTRICT_ZOMBIES 1 /              # adds the variable to the root/server level, and hence affect all suites on the server

       ecflow_client --alter add variable ECF_NONSTRICT_ZOMBIES 1 /suiteX      # adds the variable at the suite level,, and hence only affects this suite.

 

 

 

 

  • No labels