Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency


Chronologique Discussions 
  • From: Benoît Lizé <benoit.lize@gmail.com>
  • To: Marc Sergent <marc.sergent@inria.fr>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] Error with MPI and _starpu_release_data_enforce_sequential_consistency
  • Date: Thu, 28 Mar 2013 15:39:39 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

I have the same issue with a much larger code, and was trying to find the root cause before posting a message to this mailing list.
Here is a stack trae I get:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe3760700 (LWP 19450)]
0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff4bb5475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff4bb86f0 in *__GI_abort () at abort.c:92
#2  0x00007ffff5871443 in _starpu_release_data_enforce_sequential_consistency (
    task=0x7fffd7381b90, handle=0x7546e70) at core/dependencies/implicit_data_deps.c:355
#3  0x00007ffff5871b28 in _starpu_unlock_post_sync_tasks (handle=0x7546e70)
    at core/dependencies/implicit_data_deps.c:523
#4  0x00007ffff589ae6f in starpu_data_release_on_node (handle=0x7546e70, node=0)
    at datawizard/user_interactions.c:328
#5  0x00007ffff589ae8e in starpu_data_release (handle=0x7546e70)
    at datawizard/user_interactions.c:333
#6  0x00007ffff5643fd3 in _starpu_mpi_handle_request_termination (req=0x7fffd43288f0)
    at starpu_mpi.c:666
#7  0x00007ffff56446d8 in _starpu_mpi_test_detached_requests () at starpu_mpi.c:755
#8  0x00007ffff5645207 in _starpu_mpi_progress_thread_func (arg=0x102a800) at starpu_mpi.c:904
#9  0x00007ffff4f13b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007ffff4c5da7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()


Here is some context:
  • HMatrix code, using "user" datatypes (custom interface with pack_data()/unpack_data())
  • Reproductible, so not likely to be a race
  • I have only seen it crashing with more than 2 nodes
  • StarPU SVN r9003
  • Debian testing x86_64, OpenMPI 1.4.4

Unfortunately, I don't have anything pointing to the root cause...

-- 
Benoît Lizé




On Thu, Mar 28, 2013 at 3:32 PM, Marc Sergent <marc.sergent@inria.fr> wrote:
Hi!

When executing mpi/tests/user_defined_datatype with the latest version of the trunk and MPICH2 version 1.4.1, i sometimes get the following error:

(the test is run on my laptop which runs a debian and starpu is configured with fxt and mkl)

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
Testing with function 0x401b30
Testing with function 0x401870
core/dependencies/implicit_data_deps.c:355 pthread_mutex_lock: Invalid argument
[starpu][abort] core/dependencies/implicit_data_deps.c:355 _starpu_release_data_enforce_sequential_consistency

Program received signal SIGABRT, Aborted.

starpu-mpi is dealing with a detached request, and is calling starpu_data_release on the data which calls _starpu_unlock_post_sync_tasks which itself calls _starpu_release_data_enforce_sequential_consistency. This function is  able to take the lock sequential_consistency_mutex but fails to unlock it before exiting.

Here the backtrace and some values which could be interesting. The problem can be reproduced quite easily, i can send any other data if needed.

(gdb) bt
#0  0x00007ffff43d7475 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff43da6f0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff5a07ec4 in _starpu_release_data_enforce_sequential_consistency ( task=0x1fd55d0, handle=0x158f7a0)
    at core/dependencies/implicit_data_deps.c:437
#3  0x00007ffff5a00515 in _starpu_handle_job_termination (j=j@entry=0x1fd5960)
    at core/jobs.c:177
#4  0x00007ffff5a11ae0 in _starpu_push_task (j=j@entry=0x1fd5960)
    at core/sched_policy.c:357
#5  0x00007ffff5a00af3 in _starpu_enforce_deps_and_schedule ( j=j@entry=0x1fd5960) at core/jobs.c:381
#6  0x00007ffff5a014ec in _starpu_submit_job (j=j@entry=0x1fd5960)
    at core/task.c:249
#7  0x00007ffff5a02984 in starpu_task_submit (task=0x1fd55d0)
    at core/task.c:473
#8  0x00007ffff5a02c02 in _starpu_task_submit_internally (task=<optimized out>)
    at core/task.c:489
#9  0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)
    at core/dependencies/implicit_data_deps.c:525
#10 0x00007ffff5a2402b in starpu_data_release_on_node (handle=<optimized out>, 
    node=node@entry=0) at datawizard/user_interactions.c:330
#11 0x00007ffff5a24057 in starpu_data_release (handle=<optimized out>)
    at datawizard/user_interactions.c:335
#12 0x00007ffff5c97b1c in _starpu_mpi_handle_request_termination ( req=req@entry=0x1fd5250) at starpu_mpi.c:677
#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests ()
    at starpu_mpi.c:767
#14 _starpu_mpi_progress_thread_func (arg=0x60c530) at starpu_mpi.c:915
#15 0x00007ffff534ab50 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007ffff447fa7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

(gdb) f 9
#9  0x00007ffff5a081cf in _starpu_unlock_post_sync_tasks (handle=0x158f7a0)
    at core/dependencies/implicit_data_deps.c:525
525                             int ret = _starpu_task_submit_internally(link->task);
(gdb) p *link->task 
$3 = {cl = 0x0, buffers = {{handle = 0x0, mode = STARPU_NONE}, {handle = 0x0, 
      mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0, 
      mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0, 
      mode = STARPU_NONE}, {handle = 0x0, mode = STARPU_NONE}, {handle = 0x0, 
      mode = STARPU_NONE}}, handles = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0}, interfaces = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, cl_arg = 0x0, 
  cl_arg_size = 0, callback_func = 0, callback_arg = 0x0, use_tag = 0, 
  tag_id = 0, sequential_consistency = 1, synchronous = 0, priority = 0, 
  execute_on_a_specific_worker = 0, workerid = 0, bundle = 0x0, detach = 1, 
  destroy = 1, regenerate = 0, status = STARPU_TASK_FINISHED, 
  profiling_info = 0x0, predicted = nan(0x8000000000000), 
  predicted_transfer = nan(0x8000000000000), prev = 0x0, next = 0x0, 
  mf_skip = 0, starpu_private = 0x1fd5960, magic = 42, sched_ctx = 0, 
  hypervisor_tag = 0, flops = 0, scheduled = 0}

(gdb) f 13
#13 0x00007ffff5c97f5e in _starpu_mpi_test_detached_requests ()   at starpu_mpi.c:767
767                             _starpu_mpi_handle_request_termination(req);

(gdb) p *req->data_handle 
$6 = {req_list = 0x1591540, refcnt = 0, current_mode = STARPU_R, 
  header_lock = {lock = 1}, busy_count = 0, busy_waiting = 1, busy_mutex = {
    __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, 
      __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, busy_cond = {__data = {__lock = 1, __futex = 0, 
      __total_seq = 18446744073709551615, __wakeup_seq = 0, __woken_seq = 0, 
      __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, 
    __size = "\001\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377", '\000' <repeats 31 times>, __align = 1}, root_handle = 0x158f7a0, 
  father_handle = 0x0, sibling_index = 0, depth = 1, children = 0x0, 
  nchildren = 0, per_node = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0, 
      data_interface = 0x15914e0, memory_node = 0, relaxed_coherency = 0, 
      initialized = 0, state = STARPU_OWNER, refcnt = 0, allocated = 1 '\001', 
      automatically_allocated = 0 '\000', mc = 0x0, requested = "", request = {
        0x0}}}, per_worker = {{_prev = 0x0, _next = 0x0, handle = 0x158f7a0, 
      data_interface = 0x1591500, memory_node = 0, relaxed_coherency = 1, 
      initialized = 0, state = STARPU_INVALID, refcnt = 0, 
      allocated = 0 '\000', automatically_allocated = 0 '\000', mc = 0x0, 
      requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0, 
      handle = 0x158f7a0, data_interface = 0x1591520, memory_node = 0, 
      relaxed_coherency = 1, initialized = 0, state = STARPU_INVALID, 
      refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000', 
      mc = 0x0, requested = "", request = {0x0}}, {_prev = 0x0, _next = 0x0, 
      handle = 0x0, data_interface = 0x0, memory_node = 0, 
      relaxed_coherency = 0, initialized = 0, state = STARPU_OWNER, 
      refcnt = 0, allocated = 0 '\000', automatically_allocated = 0 '\000', 
      mc = 0x0, requested = "", request = {0x0}} <repeats 78 times>}, 
  ops = 0x6048a0, footprint = 3655735684, home_node = 0, wt_mask = 0, 
  is_readonly = 0 '\000', is_not_important = 0, sequential_consistency = 1, 
  sequential_consistency_mutex = {__data = {__lock = 0, __count = 0, 
      __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __list = {
        __prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 16 times>"\377, \377\377\377", '\000' <repeats 19 times>, __align = 0}, last_submitted_mode = STARPU_RW, 
  last_submitted_writer = 0x0, last_submitted_readers = 0x0, 
  last_submitted_ghost_writer_id_is_valid = 1, 
  last_submitted_ghost_writer_id = 29305, 
  last_submitted_ghost_readers_id = 0x0, post_sync_tasks = 0x0, 
  post_sync_tasks_cnt = 0, redux_cl = 0x0, init_cl = 0x0, 
  reduction_refcnt = 0, reduction_req_list = 0x1591560, 
  reduction_tmp_handles = {0x0 <repeats 80 times>}, lazy_unregister = 0, 
  rank = 1, tag = 4813, memory_stats = 0x0, mf_node = 6525744}

Thanks for any help!

Marc


_______________________________________________
Starpu-devel mailing list
Starpu-devel@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel





Archives gérées par MHonArc 2.6.19+.

Haut de le page