Molcas 7.8 over InfiniBand - problem with nodes


[ Molcas user's WWWBoard ]

Posted by Piotr Stuglik on July 07, 2013 at 20:19:12:

Hi,

it's me again. I found an environment able to compile Molcas 7.8 without any failed tests: GCC 4.7.2 and OpenMPI 1.6.3.

The problem is that Molcas works only on 1 node so far (be it 1, 2, 3 or even 8 CPUs - doesn't matter). Running Hello world on several nodes using this environment works perfectly fine.

Does anyone know what could be wrong? Thanks in advance.

Regards

Piotr

This is the STDOUT I get:

>>
lajkonik@wn732: ~/molcas/new_molcas78-gcc472-openmpi163 $ ~/bin/molcas verify -v
Test
Segmentation Error
Failed!
<<

Some time after killing the test process I get:

>>
[wn722:28671] [[33746,0],1] ORTE_ERROR_LOG: Not found in file ess_tm_module.c at line 241
[wn722:28671] [[33746,0],1] -> [[33746,0],0] (node: NULL) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer
<<

Both in multiple copies (for each failed test, I presume).

Results file:

>>
MOLCAS verification run
Machine: Linux wn732 2.6.18-348.3.1.el5 #1 SMP Tue Mar 12 08:02:37 CET 2013 x86_64 x86_64 x86_64 GNU/Linux
Date: Sun Jul 7 20:05:03 CEST 2013

test000 Status: Failed!
test001 Status: Failed!
test002 Status: Failed!
test003 Status: Failed!
<<

test000.err:

>>
--- Start Module: auto at Sun Jul 7 20:05:09 2013
>>> Export MOLCAS_PRINT=VERBOSE
--- Start Module: gateway at Sun Jul 7 20:05:17 2013
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: wn732 (PID 7209)
MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
ARMCI master: wait for child process (server) failed:: No child processes
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Last System Error Message from Task 1:: Connection refused
Last System Error Message from Task 2:: Connection refused
Last System Error Message from Task 3:: Connection refused
Last System Error Message from Task 0:: Connection refused
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 28388 on
node wn722 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Last System Error Message from Task 5:: Connection refused
Last System Error Message from Task 6:: Connection refused
Last System Error Message from Task 7:: Connection refused
[wn732:07208] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:warn-fork
[wn732:07208] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Command exited with non-zero status 1
real 4.91
user 3.23
sys 0.17
--- Stop Module: gateway at Sun Jul 7 20:05:22 2013 /rc= _JOB_KILLED_ ---
--- Stop Module: auto at Sun Jul 7 20:05:23 2013 /rc= _JOB_KILLED_ ---
--- Module auto spent 14 seconds
<<

test000.log:

>>
--- Start Module: gateway at Sun Jul 7 20:05:17 2013
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10004:Segmentation Violation error, status=: 11
(rank:-10004 hostname:wn722 pid:28402):ARMCI DASSERT fail. signaltrap.c:SigSegvH
andler():301 cond:0
4:Child process terminated prematurely, status=: 11
(rank:4 hostname:wn722 pid:28388):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
0:Terminate signal was sent, status=: 15
(rank:0 hostname:wn732 pid:7209):ARMCI DASSERT fail. signaltrap.c:SigTermHandler():463 cond:0
--- Stop Module: gateway at Sun Jul 7 20:05:22 2013 /rc= _JOB_KILLED_ ---
Non-zero return code - check program input/output

Code has been interrupted

...................................................................................................
...................................................................................................
.....Dave, this conversation can serve no purpose anymore. Goodbye.................................
...................................................................................................

--- Stop Module: auto at Sun Jul 7 20:05:23 2013 /rc= _JOB_KILLED_ ---
--- Module auto spent 14 seconds
>>

ps xf output for first node:

>>
[wcss] lajkonik@supernova ~ ssh wn732 ps xf
PID TTY STAT TIME COMMAND
8756 ? S 0:00 sshd: lajkonik@notty
8757 ? Rs 0:00 \_ ps xf
1728 pts/2 S 0:00 -bash
1729 pts/2 S 0:00 \_ pbs_demux
7702 pts/2 S+ 0:00 \_ /bin/sh /home/lajkonik/bin/molcas verify -v
7707 pts/2 S+ 0:00 \_ /bin/sh /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/sbin/verify -v
8646 pts/2 S+ 0:00 \_ /bin/sh /usr/local/bin/molcas -ign test006.input -f
8652 pts/2 S+ 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/molcas.exe test006.input -ign
8653 pts/2 S+ 0:00 \_ perl -e ???????????????????????????use 5.008;??use File::Copy;?use File::Basename;?use Cwd;???????$CR=qq/\n/;??$MOLCAS=$ENV{q/MOLCAS/} || die qq/This script can run only in Molcas environment$CR/;?$HOME=$ENV{q/HOME/};?@RCfiles=(qq!$HOME/.Molcas/molcasrc!,qq!$MOLCAS/molcasrc!);???$MOLCAS_STAMP=5;?$MOLCAS_PID=$$;?$MOLCAS_SUBMIT_PWD=&my_getcwd();??$SIG{INT}=sub {die qq/$CR Interrupted by SIGINT (^C)$CR/ };???$ENV{q/MOLCAS_STAMP/}=$MOLCAS_STAMP;?$TIMER{q/Global/}=time();?$ECHO_OFF=0;??????$ENV{q/LANG/} =q/C/;?$ENV{q/LC_ALL/} =q/C/;??&run_tools();???????$narg=scalar(@ARGV);?$runarg=q//;?$help=0;?$useauto=0;?$newworkdir=0;?$showenv=0;?$ignoreenv=0;??$onlyonce=1;??for($i=0; $i<$narg; $i++)?{?$arg=$ARGV[$i]; $ok=0;?if($arg eq q/-np/) { $ok=1; $i++; $TOP_CPUS=$ARGV[$i]; }?if($arg eq q/-clean/) { $ok=1; $ENV{q/MOLCAS_KEEP_WORKDIR/}=q/NO/;}?if($arg eq q/-old/) { $ok=1; $ENV{q/MOLCAS_NEW_WORKDIR/}=q/NO/;}?if($arg eq q/-env/) { $ok=1; $showenv=1; }?if($arg eq q/-new/) { $ok=1; $newworkdir=1; }?if($arg eq q/-ign/) { $ok=1; $ignoreenv=1; }?if($arg eq q/-help/) { $ok=1; $help=1; last;}??if($arg=~/^-/ && $ok ==0)?{?&mydie($_RC_INPUT_ERROR_,qq/Molcas flag not recognized : $arg/);?}?if($arg=~/=/)?{?($var,$val)=split(/=/,$arg);?$ENV{qq/$var/}=$val; $ok=1;?print q/>>SET /,$var,q/ to /,$ENV{qq/$var/},qq/$CR/;?}?if($ok==0) {?$runarg=$runarg.q/ /.$arg;???$useauto++;???}?}?????if($ignoreenv==0)?{?foreach $rcfile (@RCfiles)?{?$rcexist=1;?open(RC,qq/<$rcfile/) or $rcexist=0;?if($rcexist==1)?{?while($l=)?{?next if ( $l!~/=/);?chomp($l);?($key,$value)=split(/=/,$l);?next if (defined($ENV{qq/$key/}));?$ENV{qq/$key/}=$value;?}?}?}?}?if($showenv==1)?{?$ic=0;?foreach $ee (keys %ENV)?{?if($ee=~/^MOLCAS/)?{??next if ($ee=~/SUBMIT_PWD/);??next if ($ee=~/MOLCAS_STAMP/);?? $valee=$ENV{qq/$ee/};?? chomp($valee);?? printf (qq/%20s=%-30s/,$ee,$valee);?? if($ic==1)?? { print qq/$CR/;?? $ic=0;?? }?? else?? {?? print q/ /;?? $ic++;?? }??}??}?print qq/$CR/;?if($useauto==0) {exit (1);}?}??$MOLCAS_OUTPUT=$ENV{q/MOLCAS_OUTPUT/} || $ENV{q/PBS_O_WORKDIR/} || $MOLCAS_SUBMIT_PWD;?if($MOLCAS_OUTPUT eq q/PWD/) {$MOLCAS_OUTPUT=$MOLCAS_SUBMIT_PWD;}?$ENV{q/MOLCAS_OUTPUT/}=$MOLCAS_OUTPUT;??????if ( $help == 1 || $narg == 0 )?{?&print_help();?exit (1);?}??$runarg=~s/^ *//;???if($useauto == 1) {$runarg=q/auto /.$runarg;}?$runarg=~s/ run //;????&make_preparations($newworkdir,$MOLCAS_PID,1);???print_banner() if ($ENV{q/MOLCAS_IN_GEO/} ne q/Yes/);??$ENV{q/SubProject/}=q//;?$autoenv=q/auto.env/;?if( -f $autoenv ) {unlink($autoenv);}?$XMLfile=qq!$WorkDir/xmldump!;????open(XML,qq/>$XMLfile/);?print XML qq!$CR!;?print XML qq!$CR!;?print XML qq! MOLCAS$CR!;?print XML qq! $VE.$PA$CR!;??print XML qq! $now_string$CR!;?$nCpus=$ENV{q/CPUS/};?print XML qq! $nCpus$CR!;?print XML qq!$CR!;??close(XML);?????$rc=run_molcas($runarg);?exit ($rc);?????sub make_preparations()?{?$newworkdir=$_[0];?$MOLCAS_PID=$_[1];?$runs=$_[2];??$TOP_WORKDIR=$ENV{q/MOLCAS_WORKDIR/};??if($newworkdir==1 && defined ($ENV{q/CPUS/}) && $ENV{q/CPUS/} > 1)?{?$newworkdir=0;?print STDERR qq/In parallel environment there is no automatic cleaning of WorkDirs$CR/;?}??if(!defined($ENV{q/Project/}) && (!defined($ENV{q/MOLCAS_PROJECT/})))?{?$TOP_PROJECT=q/NAME/;?}?else?{?$TOP_PROJECT=$ENV{q/MOLCAS_PROJECT/};?}?if(defined($TOP_PROJECT))?{?if($TOP_PROJECT eq q/NAMEPID/)?{?$TOP_PROJECT=q/NAME/;?$USEPID=1;?}???if($TOP_PROJECT eq q/NAME/)?{?$in_name=$runarg;?$in_name=~s/.* //;?$in_name=~s/\.\w*$//;?$TOP_PROJECT=$in_name;?}??if($TOP_PROJECT eq q/TIME/)?{??$TOP_PROJECT = &my_strftime (1);??}?}??$Project=$ENV{q/Project/} || $TOP_PROJECT;?$Project=~s/.*\///;?$ENV{q/Project/}=$Project;??if(defined($TOP_WORKDIR))?{?if($TOP_WORKDIR!~/^\//)?{?if($TOP_WORKDIR eq q/PWD/)?{?$TOP_WORKDIR=&my_getcwd();?}?else?{?$TOP_WORKDIR=&my_getcwd().q!/!.$TOP_WORKDIR;?}?}?else?{?if(defined($Project))?{?$TOP_WORKDIR=$TOP_WORKDIR.q!/!.$Project;?if($U
8739 pts/2 S+ 0:00 \_ /usr/bin/time -p /usr/local/mpi/gcc-4.7.2/openmpi-1.6.3/bin/mpirun -np 8 /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8740 pts/2 S+ 0:00 \_ /usr/local/mpi/gcc-4.7.2/openmpi-1.6.3/bin/mpirun -np 8 /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8741 pts/2 SLl+ 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8742 pts/2 RLl+ 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8743 pts/2 RLl+ 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8744 pts/2 RLl+ 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe
8457 pts/2 S+ 0:00 /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/seward.exe

<<

ps xf output for second node:

>>
[wcss] lajkonik@supernova ~ ssh wn722 ps xf
PID TTY STAT TIME COMMAND
28881 ? S 0:00 sshd: lajkonik@notty
28882 ? Rs 0:00 \_ ps xf
28866 ? Ss 0:00 orted -mca ess tm -mca orte_ess_jobid 2156068864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2156068864.0;tcp://10.0.26.52:57932;tcp://192.168.26.52:57932" -mca orte_nodelist wn722
28867 ? SLl 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/parnell.exe c 1 stdin /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/Test/tmp/test000
28868 ? SLl 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/parnell.exe c 1 stdin /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/Test/tmp/test000
28869 ? SLl 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/parnell.exe c 1 stdin /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/Test/tmp/test000
28870 ? SLl 0:00 \_ /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/bin/parnell.exe c 1 stdin /home/lajkonik/molcas/new_molcas78-gcc472-openmpi163/Test/tmp/test000
<<


Follow Ups:



Post a Followup

Name:
E-Mail:

Subject:

if B is 1s22s22p1, what is Li?

Passfield:

Comments:


[ Follow Ups ] [ Post Followup ] [ Molcas user's WWWBoard ]