Hi,
I have been building a simple 2 node "cluster" to play around with OpenIFS. Each node has 4 cores. I'm able to run OpenIFS (the T21 test) just fine on both nodes separately with 4 processes. I'm also able to invoke the executable on node 2 from node 1 using mpirun, so I'm confident that the MPI connection/network settings etc. are configured correctly and that both nodes can talk to each other.
However, when I try to run OpenIFS with 8 processes across both nodes, it hangs with no output - not even a node file. I've tried the solutions to the "
" question in the FAQ but the problem remains. Are there any other common causes of this problem?When I Ctrl-C the executable I can see from the stack trace that it always seems to be stuck in SUMPINI, but I don't know which line. Also, I only get back 4 copies of the stack trace, not 8 as I would expect from an 8-process invocation.
Other details about the system that might be relevant:
- There is no shared storage yet. Both nodes have their own filesystems and OpenIFS is installed identically on both.
- Each node has ulimit -s unlimited set in the global bashrc, so I don't believe there are any memory issues. There would probably be a segfault if that was the case.
I'm running the executable as mpirun -np 8 --hostfile machinefile -x LD_LIBRARY_PATH. machinefile contains the IP addresses of both nodes.
Any ideas?
Thanks!
5 Comments
Unknown User (nagc)
Hi Sam,
Shame the traceback doesn't give a line number. It would be useful to know whether SUMPINI has got pass the CALL MPL_INIT and the c_drhook_init_signals() lines. You could try putting a write statement in (and a call to FLUSH immediately after the write to force the output) to see where the code has got to. My guess is it's stuck in the MPL_INIT call.
When the model starts up do you see 8 separate invocations of OpenIFS running; 4 on each node? If not, that suggests only one node has started the MPI tasks correctly. Are the filesystem pathnames the same on each node?
Maybe try writing a simple MPI program that does something trivial like each task writes it's task number to an output file and then try running that on 8 cores and see if that initializes correctly?
Cheers, Glenn
Sam Hatfield
Hi Glenn,
I should have mentioned that I do at least get the standard list of installed signal handles by DrHook printed to screen. Also, I can see 4 processes running on each node through top. Each one seems to spend about 99% of CPU doing a whole lot of nothing.
The filesystem pathnames are identical on both systems.
I'll have a go at adding some write statements in SUMPINI. Should I do something like:
USE YOMLUN, ONLY: NULOUT
WRITE(NULOUT,*) "Here"
CALL FLUSH(NULOUT)
?
I'll try your minimal MPI program as well.
Thanks,
Sam
Sam Hatfield
I've narrowed the problem down to this line in ifsaux/module/mpl_groups.f90:
Here are the values of the arguments (for 4 tasks across 2 nodes):
I printed numbers in various subroutines leading up to this call to MPI_CART_CREATE so see where it's getting stuck. However, I only see the output of the print statements from the master node tasks, not the slave(s).
Here are the setups that work:
Here are the setups that don't work:
Any ideas now? I'm really stumped by this.
Incidentally, I now have openifs installed in an NFS directory mounted on all nodes so all nodes see the same directory/path structure.
Sam Hatfield
By the way, here is a simple program that does NOT hang:
Invoked with:
Thanks!
Sam
Unknown User (nagc)
Hi Sam,
I checked and OpenIFS will run fine with 3 MPI tasks. Have you still got this problem?
The MPI initialization is done in ifsaux/module/mpl_init_mod.F90, after line 150. The code uses MPI_INIT_THREAD rather than MPI_INIT, I'm not sure if that would make a difference in your case but might be worth trying.
MPI_CART_CREATE is essentially a collective, you could try adding an MPI_BARRIER call just before that line to see if that works ok.
Another option would be to try another MPI implementation (MPICH or OpenMPI)? I test with both.
Perhaps when the model hangs, you could an ABORT signal to the process and then look in the traceback to see where in the MPI library it's stuck to get some more clues?
Hope that helps, not sure what the issue is.
Cheers, Glenn