Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);
Chronologique Discussions
- From: Nathalie Furmento <nathalie.furmento@labri.fr>
- To: Xavier Lacoste <xl64100@gmail.com>
- Cc: starpu-devel@lists.gforge.inria.fr
- Subject: Re: [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);
- Date: Tue, 29 Apr 2014 17:18:14 +0200
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Xavier,
I am looking at the problem, i will get back to you as soon as it is fixed. Cheers, Nathalie On 29/04/2014 10:38, Xavier Lacoste wrote: Hello, I Get a crash when trying to submit 2 remote tasks at the same time on two workers. When I had mutex protection around the starpu_mpi_insert_task() the crash disappear. So I suppose their is something which needs protection in the _starpu_mpi_already_sent() function, maybe around the HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);. As far as I understand, one thread is searching in the hash while an other is reallocating it. I copy in this mail the backtrace of the crashing mpi node. Regards, XL. 0x00007ffff530e9d6 in _starpu_mpi_already_sent (data="0x1327060," dest=1) at starpu_mpi_cache.c:229 229 HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent); (gdb) thread apply all bt Thread 7 (Thread 0x7fffa879b950 (LWP 4850)): #0 0x00007ffff35596f8 in epoll_wait () from /lib64/libc.so.6 #1 0x00007ffff4ae1b79 in epoll_dispatch (base=0x7, arg=0xb8dfc0, tv=0x20) at epoll.c:215 #2 0x00007ffff4ae2cfc in opal_event_base_loop (base=0x7, flags=12115904) at event.c:838 #3 0x00007ffff4b15938 in opal_progress () at runtime/opal_progress.c:189 #4 0x00007ffff4a3e47e in ompi_request_default_test (rptr=0x7, completed=0xb8dfc0, status=0x20) at request/req_test.c:88 #5 0x00007ffff4a62a83 in PMPI_Test (request=0x7, completed=0xb8dfc0, status=0x20) at ptest.c:61 #6 0x00007ffff53078c8 in _starpu_mpi_progress_thread_func (arg=0x7) at starpu_mpi.c:698 #7 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0 #8 0x00007ffff355910d in clone () from /lib64/libc.so.6 #9 0x0000000000000000 in ?? () Thread 5 (Thread 0x7fffeaa27950 (LWP 4848)): #0 0x00007ffff4af5398 in opal_memory_ptmalloc2_int_malloc (av=0x210, bytes=37) at malloc.c:4561 #1 0x00007ffff4af8068 in opal_memory_ptmalloc2_malloc (bytes=528) at malloc.c:3433 ---Type <return> to continue, or q <return> to quit--- #2 0x00007ffff530eae9 in _starpu_mpi_already_sent (data="0x1328eb0," dest=37) at starpu_mpi_cache.c:234 #3 0x00007ffff530c29c in _starpu_mpi_exchange_data_before_execution ( data="0x210," mode=37, me=-1596362896, dest=32, do_execute=-1594883912, comm=0x20) at starpu_mpi_insert_task.c:123 #4 0x00007ffff530b48f in starpu_mpi_insert_task (comm=0x210, codelet=0x25) at starpu_mpi_insert_task.c:365 #5 0x000000000054bd5d in starpu_dsysubmit_outgoing_fanin ( sopalin_data=0xdb15a0, fcblk=0xd1a270, hcblk=0xd1a0a0) at /home/lacoste/mateo70/pastix/build_nocuda/starpu/starpu_dsubmit.c:276 #6 0x00000000005244cb in sy_trfsp1d_gemm_starpu_common ( buffers=0x7fffa03a91a8, _args=0x7fffa03a9090, arch=0) at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:989 #7 0x00000000005258ac in sy_trfsp1d_sparse_gemm_starpu_cpu ( buffers=0x7fffa03a91a8, _args=0x7fffa03a9090) at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:1248 #8 0x00007ffff5065ba4 in _starpu_cpu_driver_run_once (cpu_worker=0x210) at drivers/cpu/driver_cpu.c:164 #9 0x00007ffff506568b in _starpu_cpu_worker (arg=0x210) at drivers/cpu/driver_cpu.c:368 #10 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0 #11 0x00007ffff355910d in clone () from /lib64/libc.so.6 #12 0x0000000000000000 in ?? () ---Type <return> to continue, or q <return> to quit--- Thread 4 (Thread 0x7fffea226950 (LWP 4847)): #0 0x00007ffff530e9d6 in _starpu_mpi_already_sent (data="0x1327060," dest=1) at starpu_mpi_cache.c:229 #1 0x00007ffff530c29c in _starpu_mpi_exchange_data_before_execution ( data="0x7fffa03a45a0," mode=STARPU_R, me=31, dest=272, do_execute=455171, comm=0x1e6cf) at starpu_mpi_insert_task.c:123 #2 0x00007ffff530b48f in starpu_mpi_insert_task (comm=0x7fffa03a45a0, codelet=0x1) at starpu_mpi_insert_task.c:365 #3 0x000000000054bd5d in starpu_dsysubmit_outgoing_fanin ( sopalin_data=0xdb15a0, fcblk=0xd1a230, hcblk=0xd1a0e0) at /home/lacoste/mateo70/pastix/build_nocuda/starpu/starpu_dsubmit.c:276 #4 0x00000000005244cb in sy_trfsp1d_gemm_starpu_common ( buffers=0x7fffa03a8ad8, _args=0x7fffa03a89c0, arch=0) at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:989 #5 0x00000000005258ac in sy_trfsp1d_sparse_gemm_starpu_cpu ( buffers=0x7fffa03a8ad8, _args=0x7fffa03a89c0) at /home/lacoste/mateo70/pastix/starpu/starpu_kernels_sy.c:1248 #6 0x00007ffff5065ba4 in _starpu_cpu_driver_run_once ( cpu_worker=0x7fffa03a45a0) at drivers/cpu/driver_cpu.c:164 #7 0x00007ffff506568b in _starpu_cpu_worker (arg=0x7fffa03a45a0) at drivers/cpu/driver_cpu.c:368 #8 0x00007ffff5bab070 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #9 0x00007ffff355910d in clone () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () Thread 1 (Thread 0x7ffff7fbb710 (LWP 4842)): #0 0x00007ffff5baed59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007ffff5019e96 in _starpu_barrier_counter_wait_for_empty_counter ( barrier_c=0x7ffff52eefec) at common/barrier_counter.c:41 #2 0x00007ffff501eb9f in starpu_task_wait_for_all () at core/task.c:743 #3 0x00000000004d02ae in sy_starpu_submit_tasks (sopalin_data=0xdb15a0) at /home/lacoste/mateo70/pastix/starpu/starpu_submit_tasks_sy.c:1492 #4 0x0000000000463333 in sy_sopalin_thread (m=0xcbbd90, sopaparam=0xcbbfa8) at /home/lacoste/mateo70/pastix/sopalin/sopalin3d_sy.c:1198 #5 0x0000000000425dc8 in pastix_task_sopalin (pastix_data=0xcbbd30, pastix_comm=0x8ef600, n=12111, colptr=0xc367e0, row=0x7fffecd46010, avals=0x7fffeccf6010, b=0xc4e270, rhsnbr=1, loc2glob=0x0) at /home/lacoste/mateo70/pastix/sopalin/pastix.c:1588 #6 0x0000000000429677 in pastix (pastix_data=0x7fffffffb1d0, pastix_comm=0x8ef600, n=12111, colptr=0xc367e0, row=0x7fffecd46010, avals=0x7fffeccf6010, perm=0xc8c830, invp=0xca42b0, b=0xc4e270, rhs=1, iparm=0x7fffffffb258, dparm=0x7fffffffb4e0) at /home/lacoste/mateo70/pastix/sopalin/pastix.c:2929 #7 0x000000000042060c in main (argc=24, argv=0x7fffffffb888) ---Type <return> to continue, or q <return> to quit--- at /home/lacoste/mateo70/pastix/example/simple.c:205 _______________________________________________ Starpu-devel mailing list Starpu-devel@lists.gforge.inria.fr http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel |
- [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);, Xavier Lacoste, 29/04/2014
- Re: [Starpu-devel] Concurrent access to HASH_FIND_PTR(_cache_sent_data[dest], &data, already_sent);, Nathalie Furmento, 29/04/2014
Archives gérées par MHonArc 2.6.19+.