Accéder au contenu.
Menu Sympa

starpu-devel - [Starpu-devel] [BUG] examples/heat/heat sometimes hangs.

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] [BUG] examples/heat/heat sometimes hangs.


Chronologique Discussions 
  • From: Cyril Roelandt <cyril.roelandt@inria.fr>
  • To: starpu-devel@lists.gforge.inria.fr
  • Subject: [Starpu-devel] [BUG] examples/heat/heat sometimes hangs.
  • Date: Fri, 27 Jan 2012 14:57:36 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

./examples/heat/heat sometimes hangs forever. Thanks to Hydra and Ludo, it is now easy to reproduce this bug.

$ ./configure --prefix=/home/croelandt/opt --enable-debug --enable-verbose --disable-cuda --disable-opencl
$ make -j
$ STARPU_NCPUS=64 ./examples/heat/heat

It may work, so it will probably be necessary to repeat that a few times :)


Helgrind does not return any interesting hint about what is going wrong. When enabling cuda and opencl, I get a segfault instead, sometimes :

Computation took (in ms)
147.07
Synthetic GFlops : 4.87

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc8fe9700 (LWP 10784)]
0x00007ffff0ddd80e in _starpu_push_task_output (j=0x7fffa8005360, mask=0) at datawizard/coherency.c:716
716 unsigned handle_was_destroyed = handle->lazy_unregister;
(gdb) p handle
$1 = (starpu_data_handle_t) 0x7fffb9229730
(gdb) p handle->lazy_unregister
Cannot access memory at address 0x7fffb922de00




It seems like this has been going on for a while, but we somehow missed it. Back in r. 5000, it was already segfaulting :

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe3fff700 (LWP 12305)]
0x00007ffff795b8b8 in ATL_scol2blk_a1 () from /usr/lib/libblas.so.3gf
(gdb) bt
#0 0x00007ffff795b8b8 in ATL_scol2blk_a1 () from /usr/lib/libblas.so.3gf
#1 0x00007ffff7a2acc4 in ATL_smmJIK2 () from /usr/lib/libblas.so.3gf
#2 0x00007ffff7a2b88f in ATL_smmJIK () from /usr/lib/libblas.so.3gf
#3 0x00007ffff795db8e in ATL_sgemm () from /usr/lib/libblas.so.3gf
#4 0x00007ffff7a32853 in ATL_sptgemm_nt () from /usr/lib/libblas.so.3gf
#5 0x00007ffff7a329c4 in ATL_sptgemm () from /usr/lib/libblas.so.3gf
#6 0x00007ffff7b9139a in sgemm_ () from /usr/lib/libblas.so.3gf
#7 0x0000000000408bdf in SGEMM (transa=<value optimized out>, transb=<value optimized out>, M=64, N=64, K=64, alpha=-1, A=<value optimized out>, lda=1024, B=0x7fffec02d090,
ldb=1024, beta=1, C=0x7fffe07b8e10, ldc=1024) at common/blas.c:247
#8 0x00000000004088fe in dw_common_cpu_codelet_update_u22 (descr=<value optimized out>, _args=<value optimized out>) at heat/dw_factolu_kernels.c:127
#9 dw_cpu_codelet_update_u22 (descr=<value optimized out>, _args=<value optimized out>) at heat/dw_factolu_kernels.c:152
#10 0x00007ffff73b5567 in execute_job_on_cpu (arg=0x7ffff75ca820) at drivers/cpu/driver_cpu.c:60
#11 _starpu_cpu_worker (arg=0x7ffff75ca820) at drivers/cpu/driver_cpu.c:183
#12 0x00007ffff68a3b40 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#13 0x00007ffff65ee36d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()



It is probably a mutex-related issue.


Cyril.




  • [Starpu-devel] [BUG] examples/heat/heat sometimes hangs., Cyril Roelandt, 27/01/2012

Archives gérées par MHonArc 2.6.19+.

Haut de le page