...
CMD variables
CMD variables shall be set and capable to submit/kill/query a job locally and remotely. They are:
on the server side:
ECF_JOB_CMD:
edit ECF_JOB_CMD '%ECF_JOB% > %ECF_JOBOUT% 2>&1' edit ECF_JOB_CMD 'rsh %ECF_JOB% > %ECF_JOBOUT% 2>&1'
ECF_KILL_CMD:
edit ECF_KILL_CMD '%kill -2 %ECF_RID% && kill -15 %ECF_RID%'
ECF_STATUS_CMD:
edit ECF_STATUS_CMD '%ps --sid %ECF_RID% -f'
on the client side:
ECF_CHECK_CMD:
edit ECF_CHECK_CMD '%ps --sid %ECF_RID% -f'
ECF_URL_CMD (for html man pages for tasks, plots display, products arrival html page):
edit URLBASE https://software.ecmwf.int/wiki/display/ edit URL ECFLOW/Home edit ECF_CHECK_CMD '${BROWSER:=firefox} -remote "openURL(%URLBASE%/%URL%)"'
alternatively, a script may be responsible for jobs submission/kill/query. At ECMWF, we use a submit script that tunes the generated job file to the remote destination. It does:
translate queuing system directives to the expected syntax,
tune submission timeout according to submit user and remote destination,
use a submition utility according to the remote system, or even the way we want the job to be submitted there: nohup, standalone, rsh, ssh, ecrcmd
keep memory of the remote queuing id given to the job, stores it in a ”.sub” file, that may be used later by kill and query commands
handle frequent or specific errors with the submission: job may have been accepted, even if the submission command is reporting an error and shall not be reported as such to the server.
example:
edit ECF_JOB_CMD '$HOME/bin/ecf_submit %USER% %HOST% %ECF_JOB% %ECF_JOBOUT% edit ECF_KILL_CMD '$HOME/bin/ecf_kill %USER% %HOST% %ECF_RID% %ECF_JOB% edit ECF_STATUS_CMD '$HOME/bin/ecf_status %USER% %HOST% %ECF_RID% %ECF_JOB%
remote jobs submission needs the server administrator, or the suite designer, to communicate with the system administration team, in order to decide:
- shared, mounted, or local file systems according to best choice or topology, in the local network.
- main submission schemes (rsh, ssh),
- alternative submission scheme (we may use nicknames to distinguish direct job submission from submission through a queueing queuing system on the same host)
- fall-back schemes (when c2a node is not available, c2a-batch is to be used, as alternative)
- the best way to handle cluster switch (from c2a to c2b, as a variable on the top node, or multiple variables among the suites, a shell variable, or even a one-line-switch in the submit script)
- to handle remote storage switch (from /s2o1 to /s22o, as a server variable or a shell variable in the jobs)
- submission time-outs,
- notification before killing a job, (sending kill -2 signal), to give a chance to send the abort command.
...