Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Segfault occurs when using openMPI memory pinning

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Segfault occurs when using openMPI memory pinning


Chronologique Discussions 
  • From: Nathalie Furmento <nathalie.furmento@labri.fr>
  • To: Marc Sergent <marc.sergent@inria.fr>, starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] Segfault occurs when using openMPI memory pinning
  • Date: Wed, 06 May 2015 11:40:26 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Marc,

Are you able to reproduce the bug on the plafrim platform?

if yes, could you provide the list of modules you are using, the queue, the options you gave when compiling StarPU & chameleon, the application you are running & all relevant informations needed to reproduce the bug.

And what does "StarPU 1.1 r15399 behaves the same as 1.2" mean?

Cheers,

Nathalie


On 06/05/2015 11:30, Marc Sergent wrote:
Hello,

I want to perform a distributed DPOTRF on 4 nodes with the Chameleon solver on top of StarPU with the openMPI memory pinning activated (export OMPI_MCA_mpi_leave_pinned_pipeline=1). My test case is N=65536, NB=512, on 4 heterogeneous nodes (8 CPUs + 2 GPUs), with Chameleon (r2201) on top of StarPU 1.2 (r15399). I made my experiments on TGCC Curie platform with BullxMPI 1.2.8.2. StarPU 1.1 r15399 behaves the same as 1.2, and this segfault can be reproduced with CPU-only nodes.

[curie7065:07257] *** Process received signal ***
[curie7065:07257] Signal: Segmentation fault (11)
[curie7065:07257] Signal code: Address not mapped (1)
[curie7065:07257] Failing at address: 0x40
[curie7065:07257] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b1af7ed9710]
[curie7065:07257] [ 1] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_rcache_vma.so(mca_rcache_vma_delete+0x1b) [0x2b1affb0703b]
[curie7065:07257] [ 2] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_mpool_rdma.so(mca_mpool_rdma_register+0xe4) [0x2b1afff0ca84]
[curie7065:07257] [ 3] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_pml_ob1.so(mca_pml_ob1_rdma_btls+0x13a) [0x2b1b0095119a]
[curie7065:07257] [ 4] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x8ba) [0x2b1b0095085a]
[curie7065:07257] [ 5] /opt/mpi/bullxmpi/1.2.8.2/lib/libmpi.so.1(MPI_Isend+0xff) [0x2b1afa48ca8f]
[curie7065:07257] [ 6] /ccc/cont003/home/gen1567/sergentm/libs/lib/libstarpumpi-1.1.so.2(+0x4c21) [0x2b1af018cc21]
[curie7065:07257] [ 7] /ccc/cont003/home/gen1567/sergentm/libs/lib/libstarpumpi-1.1.so.2(+0x9b4a) [0x2b1af0191b4a]
[curie7065:07257] [ 8] /lib64/libpthread.so.0(+0x79d1) [0x2b1af7ed19d1]
[curie7065:07257] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b1afcb318fd]

The segfault can also occur with other MPI routines (Irecv, Test mostly). I tried to attach a gdb to check the backtrace, but the segfault didn't occur when I tried.

I had also this kind of messages sometimes, but I hadn't been able to catch a backtrace from that point:

[[37962,1],1][btl_openib_component.c:3544:handle_wc] from curie7136 to: curie7138 error polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 2b813a9fb550 opcode 128 vendor error 84 qp_idx 3
[curie7136:7161] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0x2b816d80c000, size 2101248). MPI job will now abort.

Do you have an idea of what happens ?

Thanks in advance,
Marc Sergent







Archives gérées par MHonArc 2.6.19+.

Haut de le page