Accéder au contenu.
Menu Sympa

starpu-devel - [Starpu-devel] Segfault occurs when using openMPI memory pinning

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] Segfault occurs when using openMPI memory pinning


Chronologique Discussions 
  • From: Marc Sergent <marc.sergent@inria.fr>
  • To: starpu-devel@lists.gforge.inria.fr
  • Subject: [Starpu-devel] Segfault occurs when using openMPI memory pinning
  • Date: Wed, 06 May 2015 11:30:20 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

I want to perform a distributed DPOTRF on 4 nodes with the Chameleon solver on top of StarPU with the openMPI memory pinning activated (export OMPI_MCA_mpi_leave_pinned_pipeline=1). My test case is N=65536, NB=512, on 4 heterogeneous nodes (8 CPUs + 2 GPUs), with Chameleon (r2201) on top of StarPU 1.2 (r15399). I made my experiments on TGCC Curie platform with BullxMPI 1.2.8.2. StarPU 1.1 r15399 behaves the same as 1.2, and this segfault can be reproduced with CPU-only nodes.

[curie7065:07257] *** Process received signal ***
[curie7065:07257] Signal: Segmentation fault (11)
[curie7065:07257] Signal code: Address not mapped (1)
[curie7065:07257] Failing at address: 0x40
[curie7065:07257] [ 0] /lib64/libpthread.so.0(+0xf710) [0x2b1af7ed9710]
[curie7065:07257] [ 1] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_rcache_vma.so(mca_rcache_vma_delete+0x1b) [0x2b1affb0703b]
[curie7065:07257] [ 2] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_mpool_rdma.so(mca_mpool_rdma_register+0xe4) [0x2b1afff0ca84]
[curie7065:07257] [ 3] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_pml_ob1.so(mca_pml_ob1_rdma_btls+0x13a) [0x2b1b0095119a]
[curie7065:07257] [ 4] /opt/mpi/bullxmpi/1.2.8.2/lib/bullxmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x8ba) [0x2b1b0095085a]
[curie7065:07257] [ 5] /opt/mpi/bullxmpi/1.2.8.2/lib/libmpi.so.1(MPI_Isend+0xff) [0x2b1afa48ca8f]
[curie7065:07257] [ 6] /ccc/cont003/home/gen1567/sergentm/libs/lib/libstarpumpi-1.1.so.2(+0x4c21) [0x2b1af018cc21]
[curie7065:07257] [ 7] /ccc/cont003/home/gen1567/sergentm/libs/lib/libstarpumpi-1.1.so.2(+0x9b4a) [0x2b1af0191b4a]
[curie7065:07257] [ 8] /lib64/libpthread.so.0(+0x79d1) [0x2b1af7ed19d1]
[curie7065:07257] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b1afcb318fd]

The segfault can also occur with other MPI routines (Irecv, Test mostly). I tried to attach a gdb to check the backtrace, but the segfault didn't occur when I tried.

I had also this kind of messages sometimes, but I hadn't been able to catch a backtrace from that point:

[[37962,1],1][btl_openib_component.c:3544:handle_wc] from curie7136 to: curie7138 error polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 2b813a9fb550 opcode 128 vendor error 84 qp_idx 3
[curie7136:7161] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0x2b816d80c000, size 2101248). MPI job will now abort.

Do you have an idea of what happens ?

Thanks in advance,
Marc Sergent


--
Marc Sergent
Ph.D Student at Inria Bordeaux Sud-Ouest
STORM Team
Phone: (+33|0) 5 24 57 40 71





Archives gérées par MHonArc 2.6.19+.

Haut de le page