Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency
Chronologique Discussions
- From: Benoît Lizé <benoit.lize@gmail.com>
- To: Nathalie Furmento <nathalie.furmento@inria.fr>
- Cc: starpu-devel@lists.gforge.inria.fr, Marc Sergent <marc.sergent@inria.fr>
- Subject: Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency
- Date: Fri, 29 Mar 2013 14:38:55 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Indeed, it works now.
Thank you !
On Fri, Mar 29, 2013 at 11:07 AM, Nathalie Furmento <nathalie.furmento@inria.fr> wrote:
Benoit,
The bug should be fixed by r9056. Could you please try and let us know if it works now?
Marc has not been able to reproduce the bug with the fix.
Thanks,
Nathalie
On 28/03/2013 15:39, Benoît Lizé wrote:Hello,
I have the same issue with a much larger code, and was trying to find the root cause before posting a message to this mailing list.Here is a stack trae I get:
Program received signal SIGABRT, Aborted.[Switching to Thread 0x7fffe3760700 (LWP 19450)]0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)at ../nptl/sysdeps/unix/sysv/linux/raise.c:6464 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.(gdb) bt#0 0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)at ../nptl/sysdeps/unix/sysv/linux/raise.c:64#1 0x00007ffff4bb86f0 in *__GI_abort () at abort.c:92#2 0x00007ffff5871443 in _starpu_release_data_enforce_sequential_consistency (task=0x7fffd7381b90, handle=0x7546e70) at core/dependencies/implicit_data_deps.c:355#3 0x00007ffff5871b28 in _starpu_unlock_post_sync_tasks (handle=0x7546e70)at core/dependencies/implicit_data_deps.c:523#4 0x00007ffff589ae6f in starpu_data_release_on_node (handle=0x7546e70, node=0)at datawizard/user_interactions.c:328#5 0x00007ffff589ae8e in starpu_data_release (handle=0x7546e70)at datawizard/user_interactions.c:333#6 0x00007ffff5643fd3 in _starpu_mpi_handle_request_termination (req=0x7fffd43288f0)at starpu_mpi.c:666#7 0x00007ffff56446d8 in _starpu_mpi_test_detached_requests () at starpu_mpi.c:755#8 0x00007ffff5645207 in _starpu_mpi_progress_thread_func (arg=0x102a800) at starpu_mpi.c:904#9 0x00007ffff4f13b50 in start_thread (arg=<optimized out>) at pthread_create.c:304#10 0x00007ffff4c5da7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112#11 0x0000000000000000 in ?? ()
Here is some context:
- HMatrix code, using "user" datatypes (custom interface with pack_data()/unpack_data())
- Reproductible, so not likely to be a race
- I have only seen it crashing with more than 2 nodes
- StarPU SVN r9003
- Debian testing x86_64, OpenMPI 1.4.4
Unfortunately, I don't have anything pointing to the root cause...
--Benoît Lizé
On Thu, Mar 28, 2013 at 3:32 PM, Marc Sergent <marc.sergent@inria.fr> wrote:
Hi!
When executing mpi/tests/user_defined_datatype with the latest version of the trunk and MPICH2 version 1.4.1, i sometimes get the following error:
(the test is run on my laptop which runs a debian and starpu is configured with fxt and mkl)
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.Testing with function 0x401b30Testing with function 0x401870core/dependencies/implicit_data_deps.c:355 pthread_mutex_lock: Invalid argument[starpu][abort] core/dependencies/implicit_data_deps.c:355 _starpu_release_data_enforce_sequential_consistency
Program received signal SIGABRT, Aborted.
starpu-mpi is dealing with a detached request, and is calling starpu_data_release on the data which calls _starpu_unlock_post_sync_tasks which itself calls _starpu_release_data_enforce_sequential_consistency. This function is able to take the lock sequential_consistency_mutex but fails to unlock it before exiting.
Here the backtrace and some values which could be interesting. The problem can be reproduced quite easily, i can send any other data if needed.
(gdb) bt#0 0x00007ffff43d7475 in raise () from /lib/x86_64-linux-gnu/libc.so.6#1 0x00007ffff43da6f0 in abort () from /lib/x86_64-linux-gnu/libc.so.6#2 0x00007ffff5a07ec4 in _starpu_release_data_enforce_sequential_consistency ( task=0x1fd55d0, handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:437#3 0x00007ffff5a00515 in _starpu_handle_job_termination (j=j@entry=0x1fd5960)at core/jobs.c:177#4 0x00007ffff5a11ae0 in _starpu_push_task (j=j@entry=0x1fd5960)at core/sched_policy.c:357#5 0x00007ffff5a00af3 in _starpu_enforce_deps_and_schedule ( j=j@entry=0x1fd5960) at core/jobs.c:381#6 0x00007ffff5a014ec in _starpu_submit_job (j=j@entry=0x1fd5960)at core/task.c:249#7 0x00007ffff5a02984 in starpu_task_submit (task=0x1fd55d0)at core/task.c:473#8 0x00007ffff5a02c02 in _starpu_task_submit_internally (task=<optimized out>)at core/task.c:489#9 0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:525#10 0x00007ffff5a2402b in starpu_data_release_on_node (handle=<optimized out>,node=node@entry=0) at datawizard/user_interactions.c:330#11 0x00007ffff5a24057 in starpu_data_release (handle=<optimized out>)at datawizard/user_interactions.c:335#12 0x00007ffff5c97b1c in _starpu_mpi_handle_request_termination ( req=req@entry=0x1fd5250) at starpu_mpi.c:677#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests ()at starpu_mpi.c:767#14 _starpu_mpi_progress_thread_func (arg=0x60c530) at starpu_mpi.c:915#15 0x00007ffff534ab50 in start_thread ()from /lib/x86_64-linux-gnu/libpthread.so.0#16 0x00007ffff447fa7d in clone () from /lib/x86_64-linux-gnu/libc.so.6#17 0x0000000000000000 in ?? ()
(gdb) f 9#9 0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)at core/dependencies/implicit_data_deps.c:525525 int ret = _starpu_task_submit_internally(link->task);(gdb) p *link->task$3 = {cl = 0x0, buffers = {{handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0,mode = STARPU_NONE}}, handles = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0}, interfaces = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, cl_arg = 0x0,cl_arg_size = 0, callback_func = 0, callback_arg = 0x0, use_tag = 0,tag_id = 0, sequential_consistency = 1, synchronous = 0, priority = 0,execute_on_a_specific_worker = 0, workerid = 0, bundle = 0x0, detach = 1,destroy = 1, regenerate = 0, status = STARPU_TASK_FINISHED,profiling_info = 0x0, predicted = nan(0x8000000000000),predicted_transfer = nan(0x8000000000000), prev = 0x0, next = 0x0,mf_skip = 0, starpu_private = 0x1fd5960, magic = 42, sched_ctx = 0,hypervisor_tag = 0, flops = 0, scheduled = 0}
(gdb) f 13#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests () at starpu_mpi.c:767767 _starpu_mpi_handle_request_termination(req);
(gdb) p *req->data_handle$6 = {req_list = 0x1591540, refcnt = 0, current_mode = STARPU_R,header_lock = {lock = 1}, busy_count = 0, busy_waiting = 1, busy_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1,__spins = 0, __list = {__prev = 0x0, __next = 0x0}},__size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, busy_cond = {__data = {__lock = 1, __futex = 0,__total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0,__mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},__size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}, root_handle = 0x158f7a0,father_handle = 0x0, sibling_index = 0, depth = 1, children = 0x0,nchildren = 0, per_node = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0,data_interface = 0x15914e0, memory_node = 0, relaxed_coherency = 0,initialized = 0, state = STARPU_OWNER, refcnt = 0, allocated = 1 '\001',automatically_allocated = 0 '\000', mc = 0x0, requested = "", request = {0x0}}}, per_worker = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0,data_interface = 0x1591500, memory_node = 0, relaxed_coherency = 1,initialized = 0, state = STARPU_INVALID, refcnt = 0,allocated = 0 '\000', automatically_allocated = 0 '\000', mc = 0x0,requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0,handle = 0x158f7a0, data_interface = 0x1591520, memory_node = 0,relaxed_coherency = 1, initialized = 0, state = STARPU_INVALID,refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000',mc = 0x0, requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0,handle = 0x0, data_interface = 0x0, memory_node = 0,relaxed_coherency = 0, initialized = 0, state = STARPU_OWNER,refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000',mc = 0x0, requested = "", request = {0x0}} <repeats 78 times>},ops = 0x6048a0, footprint = 3655735684, home_node = 0, wt_mask = 0,is_readonly = 0 '\000', is_not_important = 0, sequential_consistency = 1,sequential_consistency_mutex = {__data = {__lock = 0, __count = 0,__owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},__size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, last_submitted_mode = STARPU_RW,last_submitted_writer = 0x0, last_submitted_readers = 0x0,last_submitted_ghost_writer_id_is_valid = 1,last_submitted_ghost_writer_id = 29305,last_submitted_ghost_readers_id = 0x0, post_sync_tasks = 0x0,post_sync_tasks_cnt = 0, redux_cl = 0x0, init_cl = 0x0,reduction_refcnt = 0, reduction_req_list = 0x1591560,reduction_tmp_handles = {0x0 <repeats 80 times>}, lazy_unregister = 0,rank = 1, tag = 4813, memory_stats = 0x0, mf_node = 6525744}
Thanks for any help!
Marc
_______________________________________________
Starpu-devel mailing list
Starpu-devel@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel
_______________________________________________ Starpu-devel mailing list Starpu-devel@lists.gforge.inria.fr http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel
- [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Marc Sergent, 28/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 28/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Nathalie Furmento, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Nathalie Furmento, 29/03/2013
- Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency, Benoît Lizé, 28/03/2013
Archives gérées par MHonArc 2.6.19+.