Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] starpu-print-all-tasks output

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] starpu-print-all-tasks output


Chronologique Discussions 
  • From: Mirko Myllykoski <mirkom@cs.umu.se>
  • To: Starpu Devel <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] starpu-print-all-tasks output
  • Date: Wed, 02 Jan 2019 12:36:17 +0100
  • Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=mirkom@cs.umu.se; spf=Pass smtp.mailfrom=mirkom@cs.umu.se; spf=None smtp.helo=postmaster@mail.cs.umu.se
  • Ironport-phdr: 9a23:CdP0aB9brxr27v9uRHKM819IXTAuvvDOBiVQ1KB30+IcTK2v8tzYMVDF4r011RmVBdWds6oMotGVmpioYXYH75eFvSJKW713fDhBt/8rmRc9CtWOE0zxIa2iRSU7GMNfSA0tpCnjYgBaF8nkelLdvGC54yIMFRXjLwp1Ifn+FpLPg8it2O2+557ebx9UiDahfLh/MAi4oQLNu8cMnIBsMLwxyhzHontJf+RZ22ZlLk+Nkhj/+8m94odt/zxftPw9+cFAV776f7kjQrxDEDsmKWE169b1uhTFUACC+2ETUmQSkhpPHgjF8BT3VYr/vyfmquZw3jSRMNboRr4oRzut86ZrSAfpiCgZMT457HrXgdF0gK5CvR6tuwBzz4vSbYqINvRxY7ndcMsVSmVdUchfWSxPDJ2hYYQNAeoPPulZoIfmqlsSrxa+BxWgCP/1xzNWgHL9wK000/4mEQHDxAEuGNMOsG7XrNXyKawcVuC0zK3MzTTDbPNW3iz96IvJchs8pvyMWLNwcdDSyUkoEgPIlUmfqZf/MzOI0+QNrnKX7/F+Wu+2jWMstgJ/oiC3y8sxhITFmpgZxk7K+Ch92oo4KsG0RFRlbdOkCJdcqT2WO5drTs4nQmxnozo2x7gYtpO+cyUG1JsqywLaZvGCcoWF4RTuX/uLLzhinnJqYre/ig6y8Ue+zu38UdG50FNQoSpEltnArGwC1wfJ5siaT/t9412t2S2R2AzL6+FEJ147lbbDJpI8zLM8ioAfvVrNEyLygkn7g7Oael869uWo8+jnZ6/ppp6YN496kAH+NaEul9S6AeQ5KAgOXG+b+eGn2bL/5k31WqhKjvgrkqXDq5DVO9wXprSlDA9NzoYj9xG/Ai+90NQEh3YHNkpFeA+eg4f0JV7OPu73DfOkjlSokTdr3OzGPqb7DpXCIHjDl63hfblj5EJGyQozy8pf55NOBb0bLvLzQBy5iNuNNQMwKRSpi8rqFth5ntcFRWuVGrLcPK7MvFug4uM0P/LKa4ESojnwbfkj/f/ny3EjzwwzZ66siL4WcnO/VtF3J0GIYn33gdZJRWILpAk5ScThkxufVCMVfHvkDPF03S0yFI/zVdSLfYuqmrHUmX7jRsQHNFADMUiFFDLTT6vBXv4NbCyIJco4y24PTv68TpJnzhz87VammYoiFfLd/2gjjbymzMJ8vrSBngp06DlpSd+QgTnUEjNE21gQTjpz55hR5ExwzlDaj/p9iv1cU9dIoe5MT0EhOMyEwg==
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

Here is some supplementary information to my previous email (the pointers do not match with the previous printouts since these are from a different run):

I took a closer look at one of the uncompleted tasks:

...

StarPU Task (0x555555da63c0)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff387eba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555555da69c0>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555555da69c0)
task: <0x555555da63c0>
submitted: <1>
terminated: <0>
job_id: <8635>
name: <_starpu_data_acquire_cb_pre>

...

(gdb) print * (struct starpu_task *) 0x555555da63c0
$1 = {name = 0x7ffff3914cd5 "_starpu_data_acquire_cb_pre", cl = 0x0, nbuffers = 0, handles = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, interfaces = {0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, modes = {STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE}, dyn_handles = 0x0,
dyn_interfaces = 0x0, dyn_modes = 0x0, cl_arg = 0x0, cl_arg_size = 0, callback_func = 0x7ffff387eba8 <starpu_data_acquire_cb_pre_sync_callback>,
callback_arg = 0x555555da6310, prologue_callback_func = 0x0, prologue_callback_arg = 0x0, prologue_callback_pop_func = 0x0, prologue_callback_pop_arg = 0x0,
tag_id = 0, cl_arg_free = 0, callback_arg_free = 0, prologue_callback_arg_free = 0, prologue_callback_pop_arg_free = 0, use_tag = 0, sequential_consistency = 1,
synchronous = 0, execute_on_a_specific_worker = 0, detach = 1, destroy = 1, regenerate = 0, mf_skip = 0, scheduled = 0, prefetched = 0, workerid = 0,
workerorder = 0, priority = 0, status = STARPU_TASK_BLOCKED_ON_TASK, magic = 42, type = 2, color = 0, sched_ctx = 0, hypervisor_tag = 0, possibly_parallel = 0,
bundle = 0x0, profiling_info = 0x555555da6d00, flops = 0, predicted = nan(0x8000000000000), predicted_transfer = nan(0x8000000000000), prev = 0x0, next = 0x0,
starpu_private = 0x555555da69c0, omp_task = 0x0}

It appears that this task is created inside the starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids function. The callback_arg field (0x555555da6310) should point to a user_interaction_wrapper struct:

(gdb) print * (struct user_interaction_wrapper *) 0x555555da6310
$2 = {handle = 0x555555d9d5b0, mode = STARPU_W, node = -2, cond = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, __g1_start32 = {
__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>,
__align = 0}, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, finished = 0, async = 1, prefetch = 0, callback = 0x7ffff388cf6c <_starpu_data_invalidate>,
callback_fetch_data = 0xffffffff, callback_arg = 0x555555d9d5b0, pre_sync_task = 0x555555da63c0, post_sync_task = 0x555555da6560}

The handle field (0x555555d9d5b0) looks relevant for me:

(gdb) starpu-print-data 0x555555d9d5b0
Data handle 0x555555d9d5b0
Matrix
Home node -1
RWlock refs 1
Busy count 3
Current mode R
Node 0 ( 1): OWNER initialized
Post sync tasks
StarPU Task (0x555555da5d30)
name: <_starpu_data_acquire_cb_post>
codelet: <(nil)>
callback: <(nil)>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_INVALID>
job: <0x555555da5ed0>
ndeps: <0>
ndeps_completed: <0>
nsuccs: <1>
StarPU Job (0x555555da5ed0)
task: <0x555555da5d30>
submitted: <0>
terminated: <0>
job_id: <8633>
name: <_starpu_data_acquire_cb_post>
Requester tasks
Arbitered requester tasks

(gdb) call starpu_mpi_data_get_tag(0x555555d9d5b0)
$5 = 560

(gdb) call starpu_mpi_data_get_rank(0x555555d9d5b0)
$6 = 1

It is noteworthy that the MPI node's ranks is 0:

(gdb) call starpu_mpi_world_rank()
$7 = 0

In summary:

Node 0 has a set of uncompleted _starpu_data_acquire_cb_post tasks (all other tasks are apparently complied) and one of these tasks is somehow involved with a data handle that is owned by node 1.

- Mirko

On 2018-12-28 11:44, Mirko Myllykoski wrote:
Hi,

I have a problem where some MPI ranks get stuck on a
starpu_task_wait_for_all function call. The
starpu_mpi_task_wait_for_all function was causing problems earlier so
I am calling the starpu_task_wait_for_all function followed by the
starpu_mpi_barrier function. This code does not use several scheduling
contexts.

I am struggling to interpret the output of the starpu-print-all-tasks
GDB command (please see the end of this email). It appears that those
MPI ranks that get stuck have several uncompleted tasks. All tasks are
of the same type (some form of synchronization mechanism?) and all of
them seem to have one uncompleted dependency. The status field hints
that the dependency is a task.

Is this output inconsistent or should I look elsewhere for the bug?

Best Regards,
Mirko Myllykoski

=====================================================================

(gdb) starpu-print-all-tasks
task 0x555556ced5c0
StarPU Task (0x555556ced5c0)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cedbc0>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cedbc0)
task: <0x555556ced5c0>
submitted: <1>
terminated: <0>
job_id: <10624>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf1170
StarPU Task (0x555556cf1170)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf1770>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf1770)
task: <0x555556cf1170>
submitted: <1>
terminated: <0>
job_id: <10633>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf2960
StarPU Task (0x555556cf2960)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf2f60>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf2f60)
task: <0x555556cf2960>
submitted: <1>
terminated: <0>
job_id: <10637>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf6480
StarPU Task (0x555556cf6480)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf6a80>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf6a80)
task: <0x555556cf6480>
submitted: <1>
terminated: <0>
job_id: <10646>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf7c70
StarPU Task (0x555556cf7c70)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf8270>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf8270)
task: <0x555556cf7c70>
submitted: <1>
terminated: <0>
job_id: <10650>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cfa960
StarPU Task (0x555556cfa960)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cfaf60>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cfaf60)
task: <0x555556cfa960>
submitted: <1>
terminated: <0>
job_id: <10657>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cfe480
StarPU Task (0x555556cfe480)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cfea80>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cfea80)
task: <0x555556cfe480>
submitted: <1>
terminated: <0>
job_id: <10666>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cffc70
StarPU Task (0x555556cffc70)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d00270>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d00270)
task: <0x555556cffc70>
submitted: <1>
terminated: <0>
job_id: <10670>
name: <_starpu_data_acquire_cb_pre>
task 0x555556d03790
StarPU Task (0x555556d03790)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d03d90>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d03d90)
task: <0x555556d03790>
submitted: <1>
terminated: <0>
job_id: <10679>
name: <_starpu_data_acquire_cb_pre>
task 0x555556d04f80
StarPU Task (0x555556d04f80)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d05580>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d05580)
task: <0x555556d04f80>
submitted: <1>
terminated: <0>
job_id: <10683>
name: <_starpu_data_acquire_cb_pre>

=====================================================================




Archives gérées par MHonArc 2.6.19+.

Haut de le page