Objet : Developers list for StarPU
Archives de la liste
- From: Mirko Myllykoski <mirkom@cs.umu.se>
- To: Starpu Devel <starpu-devel@lists.gforge.inria.fr>
- Subject: Re: [Starpu-devel] starpu-print-all-tasks output
- Date: Mon, 14 Jan 2019 19:04:23 +0100
- Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=mirkom@cs.umu.se; spf=Pass smtp.mailfrom=mirkom@cs.umu.se; spf=None smtp.helo=postmaster@mail.cs.umu.se
- Ironport-phdr: 9a23:lFFkkx/jSj6fSv9uRHKM819IXTAuvvDOBiVQ1KB30+IcTK2v8tzYMVDF4r011RmVBdWds6oMotGVmpioYXYH75eFvSJKW713fDhBt/8rmRc9CtWOE0zxIa2iRSU7GMNfSA0tpCnjYgBaF8nkelLdvGC54yIMFRXjLwp1Ifn+FpLPg8it2O2+557ebx9UiDahfLh/MAi4oQLNu8cMnIBsMLwxyhzHontJf+RZ22ZlLk+Nkhj/+8m94odt/zxftPw9+cFAV776f7kjQrxDEDsmKWE169b1uhTFUACC+2ETUmQSkhpPHgjF8BT3VYr/vyfmquZw3jSRMNboRr4oRzut86ZrSAfpiCgZMT457HrXgdF0gK5CvR6tuwBzz4vSbYqINvRxY7ndcMsVSmVdUchfWSxPDJ2hYYQNAeoPPulZoIfmqlsSrxa+BxWgCP/1xzNWgHL9wK000/4mEQHDxAEuGNMOsG7XrNXyKawcVuC0zK3MzTTDbPNW3iz96IvJchs8pvyMWLNwcdDSyUkoEgPIlUmfqZf/MzOI0+QNrnKX7/F+Wu+2jWMstgJ/oiC3y8sxhITFmpgZxk7K+Ch92oo4KsG0RFRlbdOkCJdcqT2WO5drTs4nQmxnozo2x7gYtpO+cyUG1JsqywLaZvGCcoWF4RTuX/uLLzhinnJqYre/ig6y8Ue+zu38UdG50FNQoSpEltnArGwC1wfJ5siaT/t9412t2S2R2AzL6+FEJ147lbbDJpI8zLM8ioAfvVrNEyLygkn7g7Oael869uWo8+jnZ6/ppp6YN496kAH+NaEul9S6AeQ5KAgOXG+b+eGn2bL/5k31WqhKjvgrkqXDq5DVO9wXprSlDA9NzoYj9xG/Ai+90NQEh3YHNkpFeA+eg4f0JV7OPu73DfOkjlSokTdr3OzGPqb7DpXCIHjDl63hfblj5EJGyQozy8pf55NOBb0bLvLzQBy5iNuNNQMwKRSpi8rqFth5ntcFRWuVGrLcPK7MvFug4uM0P/LKa4ESojnwbfkj/f/ny3EjzwwzZ66siL4WcnO/VtF3J0GIYn33gdZJRWILpAk5ScThkxufVCMVfHvkDPF03S0yFI/zVdSLfYuqmrHUmX7jRsQHNFADMUiFFDLTT6vBXv4NbCyIJco4y24PTv68TpJnzhz87VammYoiFfLd/2gjjbymzMJ8vrSBngp06DlpSd+QgTnUEjNE21gQTjpz55hR5ExwzlDaj/p9iv1cU9dIoe5MT0EhOMyEwg==
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi,
It appears that I have solved the problem. I forgot to flush certain data handles after pruning a branch from the task graph. I am still not 100% sure why my mistake led to this weird outcome. This also fixes my earlier problem where I had to disable to MPI cache in certain situations.
- Mirko
On 2019-01-02 12:36, Mirko Myllykoski wrote:
Hi,
Here is some supplementary information to my previous email (the
pointers do not match with the previous printouts since these are from
a different run):
I took a closer look at one of the uncompleted tasks:
...
StarPU Task (0x555555da63c0)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff387eba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555555da69c0>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555555da69c0)
task: <0x555555da63c0>
submitted: <1>
terminated: <0>
job_id: <8635>
name: <_starpu_data_acquire_cb_pre>
...
(gdb) print * (struct starpu_task *) 0x555555da63c0
$1 = {name = 0x7ffff3914cd5 "_starpu_data_acquire_cb_pre", cl = 0x0,
nbuffers = 0, handles = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
interfaces = {0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, modes = {STARPU_NONE, STARPU_NONE,
STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE, STARPU_NONE,
STARPU_NONE}, dyn_handles = 0x0,
dyn_interfaces = 0x0, dyn_modes = 0x0, cl_arg = 0x0, cl_arg_size =
0, callback_func = 0x7ffff387eba8
<starpu_data_acquire_cb_pre_sync_callback>,
callback_arg = 0x555555da6310, prologue_callback_func = 0x0,
prologue_callback_arg = 0x0, prologue_callback_pop_func = 0x0,
prologue_callback_pop_arg = 0x0,
tag_id = 0, cl_arg_free = 0, callback_arg_free = 0,
prologue_callback_arg_free = 0, prologue_callback_pop_arg_free = 0,
use_tag = 0, sequential_consistency = 1,
synchronous = 0, execute_on_a_specific_worker = 0, detach = 1,
destroy = 1, regenerate = 0, mf_skip = 0, scheduled = 0, prefetched =
0, workerid = 0,
workerorder = 0, priority = 0, status = STARPU_TASK_BLOCKED_ON_TASK,
magic = 42, type = 2, color = 0, sched_ctx = 0, hypervisor_tag = 0,
possibly_parallel = 0,
bundle = 0x0, profiling_info = 0x555555da6d00, flops = 0, predicted
= nan(0x8000000000000), predicted_transfer = nan(0x8000000000000),
prev = 0x0, next = 0x0,
starpu_private = 0x555555da69c0, omp_task = 0x0}
It appears that this task is created inside the
starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids
function. The callback_arg field (0x555555da6310) should point to a
user_interaction_wrapper struct:
(gdb) print * (struct user_interaction_wrapper *) 0x555555da6310
$2 = {handle = 0x555555d9d5b0, mode = STARPU_W, node = -2, cond =
{__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}},
{__g1_start = 0, __g1_start32 = {
__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0,
0}, __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}}, __size =
'\000' <repeats 47 times>,
__align = 0}, lock = {__data = {__lock = 0, __count = 0, __owner =
0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list =
{__prev = 0x0,
__next = 0x0}}, __size = '\000' <repeats 39 times>, __align =
0}, finished = 0, async = 1, prefetch = 0, callback = 0x7ffff388cf6c
<_starpu_data_invalidate>,
callback_fetch_data = 0xffffffff, callback_arg = 0x555555d9d5b0,
pre_sync_task = 0x555555da63c0, post_sync_task = 0x555555da6560}
The handle field (0x555555d9d5b0) looks relevant for me:
(gdb) starpu-print-data 0x555555d9d5b0
Data handle 0x555555d9d5b0
Matrix
Home node -1
RWlock refs 1
Busy count 3
Current mode R
Node 0 ( 1): OWNER initialized
Post sync tasks
StarPU Task (0x555555da5d30)
name: <_starpu_data_acquire_cb_post>
codelet: <(nil)>
callback: <(nil)>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_INVALID>
job: <0x555555da5ed0>
ndeps: <0>
ndeps_completed: <0>
nsuccs: <1>
StarPU Job (0x555555da5ed0)
task: <0x555555da5d30>
submitted: <0>
terminated: <0>
job_id: <8633>
name: <_starpu_data_acquire_cb_post>
Requester tasks
Arbitered requester tasks
(gdb) call starpu_mpi_data_get_tag(0x555555d9d5b0)
$5 = 560
(gdb) call starpu_mpi_data_get_rank(0x555555d9d5b0)
$6 = 1
It is noteworthy that the MPI node's ranks is 0:
(gdb) call starpu_mpi_world_rank()
$7 = 0
In summary:
Node 0 has a set of uncompleted _starpu_data_acquire_cb_post tasks
(all other tasks are apparently complied) and one of these tasks is
somehow involved with a data handle that is owned by node 1.
- Mirko
On 2018-12-28 11:44, Mirko Myllykoski wrote:
Hi,
I have a problem where some MPI ranks get stuck on a
starpu_task_wait_for_all function call. The
starpu_mpi_task_wait_for_all function was causing problems earlier so
I am calling the starpu_task_wait_for_all function followed by the
starpu_mpi_barrier function. This code does not use several scheduling
contexts.
I am struggling to interpret the output of the starpu-print-all-tasks
GDB command (please see the end of this email). It appears that those
MPI ranks that get stuck have several uncompleted tasks. All tasks are
of the same type (some form of synchronization mechanism?) and all of
them seem to have one uncompleted dependency. The status field hints
that the dependency is a task.
Is this output inconsistent or should I look elsewhere for the bug?
Best Regards,
Mirko Myllykoski
=====================================================================
(gdb) starpu-print-all-tasks
task 0x555556ced5c0
StarPU Task (0x555556ced5c0)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cedbc0>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cedbc0)
task: <0x555556ced5c0>
submitted: <1>
terminated: <0>
job_id: <10624>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf1170
StarPU Task (0x555556cf1170)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf1770>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf1770)
task: <0x555556cf1170>
submitted: <1>
terminated: <0>
job_id: <10633>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf2960
StarPU Task (0x555556cf2960)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf2f60>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf2f60)
task: <0x555556cf2960>
submitted: <1>
terminated: <0>
job_id: <10637>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf6480
StarPU Task (0x555556cf6480)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf6a80>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf6a80)
task: <0x555556cf6480>
submitted: <1>
terminated: <0>
job_id: <10646>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cf7c70
StarPU Task (0x555556cf7c70)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cf8270>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cf8270)
task: <0x555556cf7c70>
submitted: <1>
terminated: <0>
job_id: <10650>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cfa960
StarPU Task (0x555556cfa960)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cfaf60>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cfaf60)
task: <0x555556cfa960>
submitted: <1>
terminated: <0>
job_id: <10657>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cfe480
StarPU Task (0x555556cfe480)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556cfea80>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556cfea80)
task: <0x555556cfe480>
submitted: <1>
terminated: <0>
job_id: <10666>
name: <_starpu_data_acquire_cb_pre>
task 0x555556cffc70
StarPU Task (0x555556cffc70)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d00270>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d00270)
task: <0x555556cffc70>
submitted: <1>
terminated: <0>
job_id: <10670>
name: <_starpu_data_acquire_cb_pre>
task 0x555556d03790
StarPU Task (0x555556d03790)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d03d90>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d03d90)
task: <0x555556d03790>
submitted: <1>
terminated: <0>
job_id: <10679>
name: <_starpu_data_acquire_cb_pre>
task 0x555556d04f80
StarPU Task (0x555556d04f80)
name: <_starpu_data_acquire_cb_pre>
codelet: <(nil)>
callback: <0x7ffff3876ba8>
synchronous: <0>
execute_on_a_specific_worker: <0>
workerid: <0>
detach: <1>
destroy: <1>
regenerate: <0>
status: <STARPU_TASK_BLOCKED_ON_TASK>
job: <0x555556d05580>
ndeps: <1>
ndeps_completed: <0>
nsuccs: <0>
StarPU Job (0x555556d05580)
task: <0x555556d04f80>
submitted: <1>
terminated: <0>
job_id: <10683>
name: <_starpu_data_acquire_cb_pre>
=====================================================================
- Re: [Starpu-devel] starpu-print-all-tasks output, Mirko Myllykoski, 02/01/2019
- Re: [Starpu-devel] starpu-print-all-tasks output, Mirko Myllykoski, 14/01/2019
- Re: [Starpu-devel] starpu-print-all-tasks output, Samuel Thibault, 14/01/2019
- Re: [Starpu-devel] starpu-print-all-tasks output, Mirko Myllykoski, 14/01/2019
Archives gérées par MHonArc 2.6.19+.