Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler
Chronologique Discussions
- From: Philippe SWARTVAGHER <philippe.swartvagher@inria.fr>
- To: Samuel Thibault <samuel.thibault@inria.fr>, starpu-devel@lists.gforge.inria.fr
- Subject: Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler
- Date: Mon, 16 Mar 2020 11:09:39 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hello,
Le 14/03/2020 à 01:37, Samuel Thibault a écrit :
I pushed and
backported a fix.
Well, your fix creates random segfaults:
mpirun -DNMAD_DRIVER=tcp -DSTARPU_USE_NUMA=1 -DSTARPU_RESERVE_NCPU=2 -DSTARPU_MPI_COOP_SENDS=0 -n 2 -nodelist henri0,henri1 ~/chameleon/build/testing/chameleon_stesting -o potrf -H -n 4800:50000:8000 --mtxfmt 1
For instance, on a execution, it crashes when computing the matrix of size 44800:
/net/inria/home/pswartva/chameleon/build/testing/chameleon_stesting(+0x4d452)[0x5555555a1452]
/home/pswartva/starpu-build/lib/libstarpu-1.3.so.0(starpu_data_unpack+0x3d)[0x7ffff7ed451d]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x12ee1)[0x7ffff7fafee1]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x14ed1)[0x7ffff7fb1ed1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8fb7)[0x7ffff16b6fb7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffff11af1af]
/net/inria/home/pswartva/chameleon/build/testing/chameleon_stesting(+0x4d482)[0x5555555a1482]
/home/pswartva/starpu-build/lib/libstarpu-1.3.so.0(starpu_data_unpack+0x3d)[0x7ffff7ed451d]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x12ee1)[0x7ffff7fafee1]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x14ed1)[0x7ffff7fb1ed1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8fb7)[0x7ffff16b6fb7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffff11af1af]
/net/inria/home/pswartva/chameleon/build/testing/chameleon_stesting(+0x4d3f2)[0x5555555a13f2]
/home/pswartva/starpu-build/lib/libstarpu-1.3.so.0(starpu_data_unpack+0x3d)[0x7ffff7ed451d]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x12ee1)[0x7ffff7fafee1]
/home/pswartva/starpu-build/lib/libstarpumpi-1.3.so.0(+0x14ed1)[0x7ffff7fb1ed1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8fb7)[0x7ffff16b6fb7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffff11af1af]
Thread 105 "MPI" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffebcff9700 (LWP 22873)]
0x00007ffff12143e4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff12143e4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00005555555a14ac in cti_unpack_data_fullrank (ptr=0x7ffe49e16050,
cham_tile_interface=0x5555560eb0a0)
at /home/pswartva/chameleon/runtime/starpu/interface/cham_tile_interface.c:296
#2 cti_unpack_data (handle=<optimized out>, node=0, ptr=0x7ffe49e16030,
count=409632)
at /home/pswartva/chameleon/runtime/starpu/interface/cham_tile_interface.c:342
#3 0x00007ffff7ed451d in starpu_data_unpack (handle=0x55555595d8a0,
ptr=0x7ffe49e16030, count=409632)
at ../../src/datawizard/interfaces/data_interface.c:1093
#4 0x00007ffff7fafee1 in _starpu_mpi_handle_request_termination (
req=req@entry=0x55555622f830) at ../../../mpi/src/mpi/starpu_mpi_mpi.c:865
#5 0x00007ffff7fb1ed1 in _starpu_mpi_test_detached_requests ()
at ../../../mpi/src/mpi/starpu_mpi_mpi.c:1030
#6 _starpu_mpi_progress_thread_func (arg=0x5555556cf970)
at ../../../mpi/src/mpi/starpu_mpi_mpi.c:1334
#7 0x00007ffff16b6fb7 in start_thread (arg=<optimized out>)
at pthread_create.c:486
#8 0x00007ffff11af1af in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#1 0x00005555555a14ac in cti_unpack_data_fullrank (ptr=0x7ffe49e16050,
cham_tile_interface=0x5555560eb0a0)
at /home/pswartva/chameleon/runtime/starpu/interface/cham_tile_interface.c:296
296 memcpy( matrix, ptr, cham_tile_interface->allocsize );
(gdb) p matrix
$1 = 0x7ffe56d92000 ""
(gdb) p ptr
$2 = (void *) 0x7ffe49e16050
(gdb) p cham_tile_interface
$3 = (starpu_cham_tile_interface_t *) 0x5555560eb0a0
(gdb) p *cham_tile_interface
$4 = {id = STARPU_MAX_INTERFACE_ID, dev_handle = 140730355490816,
flttype = ChamRealFloat, allocsize = 13727159741872579443,
tilesize = 409600, tile = {format = 79 'O', m = 320, n = 320, ld = 320,
mat = 0x7ffe56d92000}}
(gdb)
Indeed, the size to copy seems wrong...! Again, this is with Madmpi, maybe the bug comes from it, we are having a lot of big changes recently in NewMadeleine.
--
Philippe SWARTVAGHER
Doctorant
Équipe TADaaM, Inria Bordeaux Sud-Ouest
- [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Philippe SWARTVAGHER, 13/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Samuel Thibault, 13/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Samuel Thibault, 13/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Samuel Thibault, 14/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Philippe SWARTVAGHER, 16/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Samuel Thibault, 16/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Philippe SWARTVAGHER, 16/03/2020
- Re: [Starpu-devel] Issue with distributed NUMA-aware StarPU and dmda scheduler, Samuel Thibault, 13/03/2020
Archives gérées par MHonArc 2.6.19+.