Hi,
I use two machines to perform OpenIFS experiments: a desktop PC for testing and a small 16-core server for running experiments. The directory within which I'm building and running OpenIFS is mounted on both machines and they should have identical environments, e.g. the same compiler versions etc. However, even though I can build and run on the desktop machine, I can't run the program on the server (though I can build successfully). I get the following backtrace:
signal_drhook(SIGABRT=6): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGBUS=7): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSEGV=11): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSTKFLT=16): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGFPE=8): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGILL=4): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGTRAP=5): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGINT=2): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGQUIT=3): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGTERM=15): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGXCPU=24): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSYS=31): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 JSETSIG: sl->active = 0 signal_harakiri(SIGALRM=14): New handler installed at 0xabeae4; old preserved at 0x0 ***Received signal = 4 and ActivatED SIGALRM=14 and calling alarm(10), time = 0.01 [myproc#1,tid#1,pid#2415,signal#4(SIGILL)]: Received signal :: 17MB (heap), 17MB (rss), 0MB (stack), 0 (paging), nsigs 1, time 0.01 tid#1 starting drhook traceback, time = 0.01 [myproc#1,tid#1,pid#2415]: MASTER [myproc#1,tid#1,pid#2415]: CNT0<1> tid#1 starting sigdump traceback, time = 0.01 [gdb__sigdump] : Received signal#4(SIGILL), pid=2415 [LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) : (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109 : master.exe() [0xaf4ce8] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:883 : master.exe() [0xabebe1] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1119 : master.exe() [0xac3b5d] (pid=2415): <Unknown> : libpthread.so.0(+0x10330) [0x2afb43e18330] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67 : master.exe() [0xb007cf] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153 : master.exe() [0xad288e] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112 : master.exe() [0x409f7f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65 : master.exe() [0x408f06] (pid=2415): <Unknown> : libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45] (pid=2415): <Unknown> : master.exe() [0x408f7d] [LinuxTraceBack] : End of backtrace(s) Done tracebacks, calling exit with sig=4, time = 0.05 ABORT! 1 Dr.Hook calls ABOR1 ... [myproc#1,tid#1,pid#2415]: MASTER [myproc#1,tid#1,pid#2415]: CNT0<1> SDL_TRACEBACK: Calling LINUX_TRBK, THRD = 1 [LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) : (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109 : master.exe() [0xaf4ce8] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:189 : master.exe() [0xaf4d1d] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/module/sdl_mod.F90:71 : master.exe() [0xb0599f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/abor1.F90:37 : master.exe() [0xab3417] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1123 : master.exe() [0xac3bb1] (pid=2415): <Unknown> : libpthread.so.0(+0x10330) [0x2afb43e18330] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67 : master.exe() [0xb007cf] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153 : master.exe() [0xad288e] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112 : master.exe() [0x409f7f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65 : master.exe() [0x408f06] (pid=2415): <Unknown> : libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45] (pid=2415): <Unknown> : master.exe() [0x408f7d] [LinuxTraceBack] : End of backtrace(s) SDL_TRACEBACK: Done LINUX_TRBK, THRD = 1 -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 2415 on node cirrus1 exited on signal 9 (Killed). --------------------------------------------------------------------------
We have made some modifications to OpenIFS, but I don't think it's a bug on our side because it works fine on the desktop PC. It looks like there's an illegal instruction in one of the clock functions. Any idea what's going wrong?
Previously I was getting a similar error originating from drhook.c line 4040, but that's gone away for some reason.
I build from scratch on both machines with gcc/gfortran version 4.8.3.
Thanks,
Sam Hatfield
4 Comments
Sam Hatfield
This is my config file by the way:
Unknown User (nagc)
Hi Sam,
Try taking the:
option off the compile options line. That can produce faster code because the compiler will autodetect the chip and compile specifically for that architecture, but I've seen it cause problems sometimes, either because the compiler is not doing the right thing or an executable is compiled on one machine and then moved to another machine, with very similar, but slightly different chip. So I now don't specify this as the default options for OpenIFS 40r1.
If that's not it, does the model run ok with 'noopt'?
Glenn
Sam Hatfield
Hi Glenn,
I think it is something to do with that flag, or maybe -m64. The desktop is Intel Core i7-4770 and the server is Intel Xeon E5630. I played around with those flags and I was able to get the model to run. When I switched back to the compilation setup shown above, however, I wasn't able to reproduce the problem from before. Well, that's how it goes...
Sam
Unknown User (nagc)
It's possible the gnu compiler is not installed correctly on the Xeon and generating bad code with -march. Do you have a more recent version to try? gfortran 4.9.3 is the earliest I test with, but I'm sure 38r1 was tested originally with gnu 4.8.3.
The -m64 option should do nothing as both are 64bit machines?
Glenn