You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Only available on EUMETSAT side. Currently in beta testing. This service is available as best effort for testing. Be aware when trying this service.


Introduction

Many problems in Earth Observation and modelling communities require a common processing algorithm independently applied to thousands (or millions!) of bits of input data.  A system to do this with many processing nodes is "High Throughput Computing" (vs "High Performance Computing", which concentrates on running large jobs that will not fit on a single machine on a pool of processing nodes, typically using MPI).

EWC provides a common solution for HTC batch processing, using HTCondor.  The major advantage of this approach is that it provides a centrally-managed system where users can take advantage of a much larger pool of resources than they have themselves.  The resources come from tenants contributing their spare resources for the common good, and additional spare resources from EWC that are also made available for anyone to use. 

The common EWC solution for HTC batch processing service is HTCondor.  HTCondor is a specialized batch system for managing compute-intensive jobs. HTCondor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications.

Users can submit their compute jobs to HTCondor, HTCondor puts the jobs in a queue, runs them, and then informs the user as to the result.

Of course, any tenant can install their own batch processing systems for their own purposes with their own resources, but will not be able to take advantage of other shared resources in a centrally organised way.


General

EWC HTcondor is a managed service. The central manager node is deployed in a tenancy on the EWC. Users can join the existing pool adding compute and submit nodes.

Some features of the HTCondor in EWC:

Maintenance

Centrally Managed Tenancy, easy 'one click' deployment

Deployment 

Multi tenancy

Resource 

Join automatically the main HTCondor Pool, no need for password or any configuration, only choose the plan for the machine your want to add

Usage

Easy 'one click' deployment, simple examples for running a job with docker univers

Network 

VPN, which allows processing nodes in a tenancy to communicate with the scheduler / master nodes

Scheduling

Single schedulers in each tenancy, no possibility to erase other tenancies jobs

Execute nodes

  • No access to execute host for containers​
  • No access to other containers running on execute node​
  • Isolated environment for containers​
  • No autoscaling​
  • No NFS

Submit nodes

  • Only docker universe allowed​
  • Only condor_submit command allowed​
  • Private network in the tenancy enabled​ to allow access to tenancy-internal resources/files
  • Condor transfer mechanism allowed

Deploy HTCondor nodes

Pre-requisite

Before deploying an HTcondor node, you need to create an htcondor specific security group. You can follow this page: Creating Security Groups to know how to create security groups.

  • htcondor security group with the following rules:

Rule name

Direction

Rule Type

Protocol

Port Range

Source Type

Source

Destination Type


egress

Custom Rule

TCP


All


Instance


egress

Custom Rule

UDP


All


Instance

9618-tcp

ingress

Custom Rule

TCP

9618

Network

100.64.0.0/10

Instance


Deploy execute or submit node

  1. Go to Provisioning → Instances and click on Add+ to add a new instance
  2. Select Htcondor Submit/Execute node
  3. Fill data required:

    • plan: choose your plan
    • network: private
    • security group: htcondor, ssh (only for submit node)

    4. Finalize provisioning steps.


Once submit node is up:

  • ssh into your machine
  • create a simple job
# dockertest.sub -- example docker job

universe                = docker
docker_image            = debian
executable              = /bin/cat
arguments               = /etc/hosts
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
log                     = log/job_$(Process)_sleep.log
output                  = output/job_$(Process)_output.txt
error                   = error/job_$(Process)_errors.txt

request_cpus   = 1
request_memory = 1024M
request_disk   = 10240K

queue 100
  • use condor_submit <job_name>
  • verify jobs are running, using condor_q command


Once execute node is up: you can check from a submit node if the node appears in the list, running condor_status


  • No labels