Accéder au contenu.
Menu Sympa

starpu-devel - [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);


Chronologique Discussions 
  • From: Xavier Lacoste <xl64100@gmail.com>
  • To: starpu-devel@lists.gforge.inria.fr
  • Subject: [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);
  • Date: Tue, 29 Apr 2014 10:38:42 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

I Get a crash when trying to submit 2 remote tasks at the same time on two
workers.
When I had mutex protection around the starpu_mpi_insert_task() the crash
disappear.
So I suppose their is something which needs protection in the
_starpu_mpi_already_sent() function, maybe around the
HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);.
As far as I understand, one thread is searching in the hash while an other is
reallocating it.

I copy in this mail the backtrace of the crashing mpi node.

Regards,

XL.


0x00007ffff530e9d6 in _starpu_mpi_already_sent (data=0x1327060, dest=1)
at starpu_mpi_cache.c:229
229 HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);
(gdb) thread apply all bt

Thread 7 (Thread 0x7fffa879b950 (LWP 4850)):
#0 0x00007ffff35596f8 in epoll_wait () from /lib64/libc.so.6
#1 0x00007ffff4ae1b79 in epoll_dispatch (base=0x7, arg=0xb8dfc0, tv=0x20)
at epoll.c:215
#2 0x00007ffff4ae2cfc in opal_event_base_loop (base=0x7, flags=12115904)
at event.c:838
#3 0x00007ffff4b15938 in opal_progress () at runtime/opal_progress.c:189
#4 0x00007ffff4a3e47e in ompi_request_default_test (rptr=0x7,
completed=0xb8dfc0, status=0x20) at request/req_test.c:88
#5 0x00007ffff4a62a83 in PMPI_Test (request=0x7, completed=0xb8dfc0,
status=0x20) at ptest.c:61
#6 0x00007ffff53078c8 in _starpu_mpi_progress_thread_func (arg=0x7)
at starpu_mpi.c:698
#7 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff355910d in clone () from /lib64/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7fffeaa27950 (LWP 4848)):
#0 0x00007ffff4af5398 in opal_memory_ptmalloc2_int_malloc (av=0x210,
bytes=37)
at malloc.c:4561
#1 0x00007ffff4af8068 in opal_memory_ptmalloc2_malloc (bytes=528)
at malloc.c:3433
---Type <return> to continue, or q <return> to quit---
#2 0x00007ffff530eae9 in _starpu_mpi_already_sent (data=0x1328eb0, dest=37)
at starpu_mpi_cache.c:234
#3 0x00007ffff530c29c in _starpu_mpi_exchange_data_before_execution (
data=0x210, mode=37, me=-1596362896, dest=32, do_execute=-1594883912,
comm=0x20) at starpu_mpi_insert_task.c:123
#4 0x00007ffff530b48f in starpu_mpi_insert_task (comm=0x210, codelet=0x25)
at starpu_mpi_insert_task.c:365
#5 0x000000000054bd5d in starpu_dsysubmit_outgoing_fanin (
sopalin_data=0xdb15a0, fcblk=0xd1a270, hcblk=0xd1a0a0)
at /home/lacoste/mateo70/pastix/build_nocuda/starpu/starpu_dsubmit.c:276
#6 0x00000000005244cb in sy_trfsp1d_gemm_starpu_common (
buffers=0x7fffa03a91a8, _args=0x7fffa03a9090, arch=0)
at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:989
#7 0x00000000005258ac in sy_trfsp1d_sparse_gemm_starpu_cpu (
buffers=0x7fffa03a91a8, _args=0x7fffa03a9090)
at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:1248
#8 0x00007ffff5065ba4 in _starpu_cpu_driver_run_once (cpu_worker=0x210)
at drivers/cpu/driver_cpu.c:164
#9 0x00007ffff506568b in _starpu_cpu_worker (arg=0x210)
at drivers/cpu/driver_cpu.c:368
#10 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff355910d in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
---Type <return> to continue, or q <return> to quit---

Thread 4 (Thread 0x7fffea226950 (LWP 4847)):
#0 0x00007ffff530e9d6 in _starpu_mpi_already_sent (data=0x1327060, dest=1)
at starpu_mpi_cache.c:229
#1 0x00007ffff530c29c in _starpu_mpi_exchange_data_before_execution (
data=0x7fffa03a45a0, mode=STARPU_R, me=31, dest=272, do_execute=455171,
comm=0x1e6cf) at starpu_mpi_insert_task.c:123
#2 0x00007ffff530b48f in starpu_mpi_insert_task (comm=0x7fffa03a45a0,
codelet=0x1) at starpu_mpi_insert_task.c:365
#3 0x000000000054bd5d in starpu_dsysubmit_outgoing_fanin (
sopalin_data=0xdb15a0, fcblk=0xd1a230, hcblk=0xd1a0e0)
at /home/lacoste/mateo70/pastix/build_nocuda/starpu/starpu_dsubmit.c:276
#4 0x00000000005244cb in sy_trfsp1d_gemm_starpu_common (
buffers=0x7fffa03a8ad8, _args=0x7fffa03a89c0, arch=0)
at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:989
#5 0x00000000005258ac in sy_trfsp1d_sparse_gemm_starpu_cpu (
buffers=0x7fffa03a8ad8, _args=0x7fffa03a89c0)
at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:1248
#6 0x00007ffff5065ba4 in _starpu_cpu_driver_run_once (
cpu_worker=0x7fffa03a45a0) at drivers/cpu/driver_cpu.c:164
#7 0x00007ffff506568b in _starpu_cpu_worker (arg=0x7fffa03a45a0)
at drivers/cpu/driver_cpu.c:368
#8 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#9 0x00007ffff355910d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7fbb710 (LWP 4842)):
#0 0x00007ffff5baed59 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x00007ffff5019e96 in _starpu_barrier_counter_wait_for_empty_counter (
barrier_c=0x7ffff52eefec) at common/barrier_counter.c:41
#2 0x00007ffff501eb9f in starpu_task_wait_for_all () at core/task.c:743
#3 0x00000000004d02ae in sy_starpu_submit_tasks (sopalin_data=0xdb15a0)
at /home/lacoste/mateo70/pastix/starpu/starpu_submit_tasks_sy.c:1492
#4 0x0000000000463333 in sy_sopalin_thread (m=0xcbbd90, sopaparam=0xcbbfa8)
at /home/lacoste/mateo70/pastix/sopalin/sopalin3d_sy.c:1198
#5 0x0000000000425dc8 in pastix_task_sopalin (pastix_data=0xcbbd30,
pastix_comm=0x8ef600, n=12111, colptr=0xc367e0, row=0x7fffecd46010,
avals=0x7fffeccf6010, b=0xc4e270, rhsnbr=1, loc2glob=0x0)
at /home/lacoste/mateo70/pastix/sopalin/pastix.c:1588
#6 0x0000000000429677 in pastix (pastix_data=0x7fffffffb1d0,
pastix_comm=0x8ef600, n=12111, colptr=0xc367e0, row=0x7fffecd46010,
avals=0x7fffeccf6010, perm=0xc8c830, invp=0xca42b0, b=0xc4e270, rhs=1,
iparm=0x7fffffffb258, dparm=0x7fffffffb4e0)
at /home/lacoste/mateo70/pastix/sopalin/pastix.c:2929
#7 0x000000000042060c in main (argc=24, argv=0x7fffffffb888)
---Type <return> to continue, or q <return> to quit---
at /home/lacoste/mateo70/pastix/example/simple.c:205

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail




Archives gérées par MHonArc 2.6.19+.

Haut de le page