Objet : Developers list for StarPU
Archives de la liste
- From: Cyril Roelandt <cyril.roelandt@inria.fr>
- To: starpu-devel@lists.gforge.inria.fr
- Subject: [Starpu-devel] [BUG] examples/heat/heat sometimes hangs.
- Date: Fri, 27 Jan 2012 14:57:36 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi,
./examples/heat/heat sometimes hangs forever. Thanks to Hydra and Ludo, it is now easy to reproduce this bug.
$ ./configure --prefix=/home/croelandt/opt --enable-debug --enable-verbose --disable-cuda --disable-opencl
$ make -j
$ STARPU_NCPUS=64 ./examples/heat/heat
It may work, so it will probably be necessary to repeat that a few times :)
Helgrind does not return any interesting hint about what is going wrong. When enabling cuda and opencl, I get a segfault instead, sometimes :
Computation took (in ms)
147.07
Synthetic GFlops : 4.87
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc8fe9700 (LWP 10784)]
0x00007ffff0ddd80e in _starpu_push_task_output (j=0x7fffa8005360, mask=0) at datawizard/coherency.c:716
716 unsigned handle_was_destroyed = handle->lazy_unregister;
(gdb) p handle
$1 = (starpu_data_handle_t) 0x7fffb9229730
(gdb) p handle->lazy_unregister
Cannot access memory at address 0x7fffb922de00
It seems like this has been going on for a while, but we somehow missed it. Back in r. 5000, it was already segfaulting :
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe3fff700 (LWP 12305)]
0x00007ffff795b8b8 in ATL_scol2blk_a1 () from /usr/lib/libblas.so.3gf
(gdb) bt
#0 0x00007ffff795b8b8 in ATL_scol2blk_a1 () from /usr/lib/libblas.so.3gf
#1 0x00007ffff7a2acc4 in ATL_smmJIK2 () from /usr/lib/libblas.so.3gf
#2 0x00007ffff7a2b88f in ATL_smmJIK () from /usr/lib/libblas.so.3gf
#3 0x00007ffff795db8e in ATL_sgemm () from /usr/lib/libblas.so.3gf
#4 0x00007ffff7a32853 in ATL_sptgemm_nt () from /usr/lib/libblas.so.3gf
#5 0x00007ffff7a329c4 in ATL_sptgemm () from /usr/lib/libblas.so.3gf
#6 0x00007ffff7b9139a in sgemm_ () from /usr/lib/libblas.so.3gf
#7 0x0000000000408bdf in SGEMM (transa=<value optimized out>, transb=<value optimized out>, M=64, N=64, K=64, alpha=-1, A=<value optimized out>, lda=1024, B=0x7fffec02d090,
ldb=1024, beta=1, C=0x7fffe07b8e10, ldc=1024) at common/blas.c:247
#8 0x00000000004088fe in dw_common_cpu_codelet_update_u22 (descr=<value optimized out>, _args=<value optimized out>) at heat/dw_factolu_kernels.c:127
#9 dw_cpu_codelet_update_u22 (descr=<value optimized out>, _args=<value optimized out>) at heat/dw_factolu_kernels.c:152
#10 0x00007ffff73b5567 in execute_job_on_cpu (arg=0x7ffff75ca820) at drivers/cpu/driver_cpu.c:60
#11 _starpu_cpu_worker (arg=0x7ffff75ca820) at drivers/cpu/driver_cpu.c:183
#12 0x00007ffff68a3b40 in start_thread (arg=<value optimized out>) at pthread_create.c:304
#13 0x00007ffff65ee36d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()
It is probably a mutex-related issue.
Cyril.
- [Starpu-devel] [BUG] examples/heat/heat sometimes hangs., Cyril Roelandt, 27/01/2012
Archives gérées par MHonArc 2.6.19+.