Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency
Chronologique Discussions
- From: Benoît Lizé <benoit.lize@gmail.com>
- To: Marc Sergent <marc.sergent@inria.fr>
- Cc: starpu-devel@lists.gforge.inria.fr
- Subject: Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency
- Date: Thu, 28 Mar 2013 15:39:39 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hello,
I have the same issue with a much larger code, and was trying to find the root cause before posting a message to this mailing list.
Here is a stack trae I get:
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe3760700 (LWP 19450)]
0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff4bb86f0 in *__GI_abort () at abort.c:92
#2 0x00007ffff5871443 in _starpu_release_data_enforce_sequential_consistency (
task=0x7fffd7381b90, handle=0x7546e70) at core/dependencies/implicit_data_deps.c:355
#3 0x00007ffff5871b28 in _starpu_unlock_post_sync_tasks (handle=0x7546e70)
at core/dependencies/implicit_data_deps.c:523
#4 0x00007ffff589ae6f in starpu_data_release_on_node (handle=0x7546e70, node=0)
at datawizard/user_interactions.c:328
#5 0x00007ffff589ae8e in starpu_data_release (handle=0x7546e70)
at datawizard/user_interactions.c:333
#6 0x00007ffff5643fd3 in _starpu_mpi_handle_request_termination (req=0x7fffd43288f0)
at starpu_mpi.c:666
#7 0x00007ffff56446d8 in _starpu_mpi_test_detached_requests () at starpu_mpi.c:755
#8 0x00007ffff5645207 in _starpu_mpi_progress_thread_func (arg=0x102a800) at starpu_mpi.c:904
#9 0x00007ffff4f13b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007ffff4c5da7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()
Here is some context:
- HMatrix code, using "user" datatypes (custom interface with pack_data()/unpack_data())
- Reproductible, so not likely to be a race
- I have only seen it crashing with more than 2 nodes
- StarPU SVN r9003
- Debian testing x86_64, OpenMPI 1.4.4
Unfortunately, I don't have anything pointing to the root cause...
--
Benoît Lizé
On Thu, Mar 28, 2013 at 3:32 PM, Marc Sergent <marc.sergent@inria.fr> wrote:
Hi!When executing mpi/tests/user_defined_datatype with the latest version of the trunk and MPICH2 version 1.4.1, i sometimes get the following error:(the test is run on my laptop which runs a debian and starpu is configured with fxt and mkl)
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.Testing with function 0x401b30Testing with function 0x401870core/dependencies/implicit_data_deps.c:355 pthread_mutex_lock: Invalid argument[starpu][abort] core/dependencies/implicit_data_deps.c:355 _starpu_release_data_enforce_sequential_consistency
Program received signal SIGABRT, Aborted.starpu-mpi is dealing with a detached request, and is calling starpu_data_release on the data which calls _starpu_unlock_post_sync_tasks which itself calls _starpu_release_data_enforce_sequential_consistency. This function is able to take the lock sequential_consistency_mutex but fails to unlock it before exiting.Here the backtrace and some values which could be interesting. The problem can be reproduced quite easily, i can send any other data if needed.(gdb) bt#0 0x00007ffff43d7475 in raise () from /lib/x86_64-linux-gnu/libc.so.6#1 0x00007ffff43da6f0 in abort () from /lib/x86_64-linux-gnu/libc.so.6#2 0x00007ffff5a07ec4 in _starpu_release_data_enforce_sequential_consistency ( task=0x1fd55d0, handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:437#3 0x00007ffff5a00515 in _starpu_handle_job_termination (j=j@entry=0x1fd5960)at core/jobs.c:177#4 0x00007ffff5a11ae0 in _starpu_push_task (j=j@entry=0x1fd5960)at core/sched_policy.c:357#5 0x00007ffff5a00af3 in _starpu_enforce_deps_and_schedule ( j=j@entry=0x1fd5960) at core/jobs.c:381#6 0x00007ffff5a014ec in _starpu_submit_job (j=j@entry=0x1fd5960)at core/task.c:249#7 0x00007ffff5a02984 in starpu_task_submit (task=0x1fd55d0)at core/task.c:473#8 0x00007ffff5a02c02 in _starpu_task_submit_internally (task=<optimized out>)at core/task.c:489#9 0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:525#10 0x00007ffff5a2402b in starpu_data_release_on_node (handle=<optimized out>,node=node@entry=0) at datawizard/user_interactions.c:330#11 0x00007ffff5a24057 in starpu_data_release (handle=<optimized out>)at datawizard/user_interactions.c:335#12 0x00007ffff5c97b1c in _starpu_mpi_handle_request_termination ( req=req@entry=0x1fd5250) at starpu_mpi.c:677#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests ()at starpu_mpi.c:767#14 _starpu_mpi_progress_thread_func (arg=0x60c530) at starpu_mpi.c:915#15 0x00007ffff534ab50 in start_thread ()from /lib/x86_64-linux-gnu/libpthread.so.0#16 0x00007ffff447fa7d in clone () from /lib/x86_64-linux-gnu/libc.so.6#17 0x0000000000000000 in ?? ()(gdb) f 9#9 0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:525525 int ret = _starpu_task_submit_internally(link->task);(gdb) p *link->task$3 = {cl = 0x0, buffers = {{handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}}, handles = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0}, interfaces = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, cl_arg = 0x0,cl_arg_size = 0, callback_func = 0, callback_arg = 0x0, use_tag = 0,tag_id = 0, sequential_consistency = 1, synchronous = 0, priority = 0,execute_on_a_specific_worker = 0, workerid = 0, bundle = 0x0, detach = 1,destroy = 1, regenerate = 0, status = STARPU_TASK_FINISHED,profiling_info = 0x0, predicted = nan(0x8000000000000),predicted_transfer = nan(0x8000000000000), prev = 0x0, next = 0x0,mf_skip = 0, starpu_private = 0x1fd5960, magic = 42, sched_ctx = 0,hypervisor_tag = 0, flops = 0, scheduled = 0}(gdb) f 13#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests () at starpu_mpi.c:767767 _starpu_mpi_handle_request_termination(req);(gdb) p *req->data_handle$6 = {req_list = 0x1591540, refcnt = 0, current_mode = STARPU_R,header_lock = {lock = 1}, busy_count = 0, busy_waiting = 1, busy_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1,__spins = 0, __list = {__prev = 0x0, __next = 0x0}},__size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, busy_cond = {__data = {__lock = 1, __futex = 0,__total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0,__mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},__size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}, root_handle = 0x158f7a0,father_handle = 0x0, sibling_index = 0, depth = 1, children = 0x0,nchildren = 0, per_node = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0,data_interface = 0x15914e0, memory_node = 0, relaxed_coherency = 0,initialized = 0, state = STARPU_OWNER, refcnt = 0, allocated = 1 '\001',automatically_allocated = 0 '\000', mc = 0x0, requested = "", request = {0x0}}}, per_worker = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0,data_interface = 0x1591500, memory_node = 0, relaxed_coherency = 1,initialized = 0, state = STARPU_INVALID, refcnt = 0,allocated = 0 '\000', automatically_allocated = 0 '\000', mc = 0x0,requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0,handle = 0x158f7a0, data_interface = 0x1591520, memory_node = 0,relaxed_coherency = 1, initialized = 0, state = STARPU_INVALID,refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000',mc = 0x0, requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0,handle = 0x0, data_interface = 0x0, memory_node = 0,relaxed_coherency = 0, initialized = 0, state = STARPU_OWNER,refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000',mc = 0x0, requested = "", request = {0x0}} <repeats 78 times>},ops = 0x6048a0, footprint = 3655735684, home_node = 0, wt_mask = 0,is_readonly = 0 '\000', is_not_important = 0, sequential_consistency = 1,sequential_consistency_mutex = {__data = {__lock = 0, __count = 0,__owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},__size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, last_submitted_mode = STARPU_RW,last_submitted_writer = 0x0, last_submitted_readers = 0x0,last_submitted_ghost_writer_id_is_valid = 1,last_submitted_ghost_writer_id = 29305,last_submitted_ghost_readers_id = 0x0, post_sync_tasks = 0x0,post_sync_tasks_cnt = 0, redux_cl = 0x0, init_cl = 0x0,reduction_refcnt = 0, reduction_req_list = 0x1591560,reduction_tmp_handles = {0x0 <repeats 80 times>}, lazy_unregister = 0,rank = 1, tag = 4813, memory_stats = 0x0, mf_node = 6525744}Thanks for any help!Marc
_______________________________________________
Starpu-devel mailing list
Starpu-devel@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel
- [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Marc Sergent, 28/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 28/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Nathalie Furmento, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Nathalie Furmento, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 28/03/2013
Archives gérées par MHonArc 2.6.19+.