Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] StarPU overhead for MPI

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] StarPU overhead for MPI


Chronologique Discussions 
  • From: Philippe SWARTVAGHER <philippe.swartvagher@inria.fr>
  • To: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] StarPU overhead for MPI
  • Date: Wed, 12 Feb 2020 09:50:24 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,
I pushed to master an optimization (which was TODO for something like a
decade) that avoids this part of the backtrace:

I tested this morning on 4 miriel nodes, with Cholesky/Chameleon. I got the following backtrace, regardless of the communication layer:

/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg+0x452)[0x2ab6954524d2]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg_list+0xf3)[0x2ab6954527b3]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_dependencies+0x24)[0x2ab695452dd4]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x40b)[0x2ab69543c08b]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_repush_task+0x163)[0x2ab695472573]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_enforce_deps_and_schedule+0x1e0)[0x2ab69543cce0]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_submit_job+0x156)[0x2ab69543f8b6]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_task_submit+0x1dc)[0x2ab69544203c]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_task_submit_internally+0x25)[0x2ab6954425a5]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_unlock_post_sync_tasks+0xf8)[0x2ab695454ab8]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(starpu_data_release_on_node+0x21)[0x2ab6954b2e51]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xbd785)[0x2ab6954b7785]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xb6546)[0x2ab6954b0546]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_create_request_to_fetch_data+0xa80)[0x2ab69549b140]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_data_on_node+0xfd)[0x2ab69549b67d]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xb6b76)[0x2ab6954b0b76]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x8c0)[0x2ab69543c540]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_repush_task+0x163)[0x2ab695472573]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_enforce_deps_starting_from_task+0x55)[0x2ab69543cef5]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg+0x379)[0x2ab6954523f9]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg_list+0xf3)[0x2ab6954527b3]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_dependencies+0x24)[0x2ab695452dd4]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x40b)[0x2ab69543c08b]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0x102672)[0x2ab6954fc672]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_driver_run_once+0x241)[0x2ab6954fcd51]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_worker+0x3d)[0x2ab6954fce3d]
/lib64/libpthread.so.0(+0x7dd5)[0x2ab69c87cdd5]
/lib64/libc.so.6(clone+0x6d)[0x2ab69d77cead]
snew-testing: /home/pswartva/dev/starpu/src/core/dependencies/cg.c:266: _starpu_notify_cg: Assertion `job_successors->ndeps >= ndeps_completed' failed.

snew-testing:12885 terminated with signal 6 at PC=2ab69d6b5207 SP=2ab6ab7fd778.  Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x2ab69d6b5207]
/lib64/libc.so.6(abort+0x148)[0x2ab69d6b68f8]
/lib64/libc.so.6(+0x2f026)[0x2ab69d6ae026]
/lib64/libc.so.6(+0x2f0d2)[0x2ab69d6ae0d2]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg+0x48d)[0x2ab69545250d]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg_list+0xf3)[0x2ab6954527b3]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_dependencies+0x24)[0x2ab695452dd4]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x40b)[0x2ab69543c08b]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_repush_task+0x163)[0x2ab695472573]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_enforce_deps_and_schedule+0x1e0)[0x2ab69543cce0]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_submit_job+0x156)[0x2ab69543f8b6]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_task_submit+0x1dc)[0x2ab69544203c]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_task_submit_internally+0x25)[0x2ab6954425a5]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_unlock_post_sync_tasks+0xf8)[0x2ab695454ab8]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(starpu_data_release_on_node+0x21)[0x2ab6954b2e51]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xbd785)[0x2ab6954b7785]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xb6546)[0x2ab6954b0546]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_create_request_to_fetch_data+0xa80)[0x2ab69549b140]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_data_on_node+0xfd)[0x2ab69549b67d]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0xb6b76)[0x2ab6954b0b76]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x8c0)[0x2ab69543c540]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_repush_task+0x163)[0x2ab695472573]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_enforce_deps_starting_from_task+0x55)[0x2ab69543cef5]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg+0x379)[0x2ab6954523f9]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_cg_list+0xf3)[0x2ab6954527b3]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_notify_dependencies+0x24)[0x2ab695452dd4]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_handle_job_termination+0x40b)[0x2ab69543c08b]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(+0x102672)[0x2ab6954fc672]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_driver_run_once+0x241)[0x2ab6954fcd51]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_worker+0x3d)[0x2ab6954fce3d]
/lib64/libpthread.so.0(+0x7dd5)[0x2ab69c87cdd5]
/lib64/libc.so.6(clone+0x6d)[0x2ab69d77cead]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_select_src_node+0x241)[0x2b08b02d7391]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_create_request_to_fetch_data+0xcba)[0x2b08b02d837a]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_data_on_node+0xfd)[0x2b08b02d867d]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_task_input+0x1c0)[0x2b08b02d9830]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_driver_run_once+0xe0)[0x2b08b0339bf0]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_worker+0x3d)[0x2b08b0339e3d]
/lib64/libpthread.so.0(+0x7dd5)[0x2b08b76b9dd5]
/lib64/libc.so.6(clone+0x6d)[0x2b08b85b9ead]

[starpu][_starpu_select_src_node][assert failure] The data for the handle 0x132c830 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?

snew-testing: /home/pswartva/dev/starpu/src/datawizard/coherency.c:70: _starpu_select_src_node: Assertion `src_node_mask != 0' failed.

snew-testing:172195 terminated with signal 6 at PC=2b08b84f2207 SP=2b08c5c373f8.  Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x2b08b84f2207]
/lib64/libc.so.6(abort+0x148)[0x2b08b84f38f8]
/lib64/libc.so.6(+0x2f026)[0x2b08b84eb026]
/lib64/libc.so.6(+0x2f0d2)[0x2b08b84eb0d2]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_select_src_node+0x291)[0x2b08b02d73e1]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_create_request_to_fetch_data+0xcba)[0x2b08b02d837a]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_data_on_node+0xfd)[0x2b08b02d867d]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_fetch_task_input+0x1c0)[0x2b08b02d9830]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_driver_run_once+0xe0)[0x2b08b0339bf0]
/lustre/pswartva/builds/nmad/starpu/install/lib/libstarpu-1.3.so.0(_starpu_cpu_worker+0x3d)[0x2b08b0339e3d]
/lib64/libpthread.so.0(+0x7dd5)[0x2b08b76b9dd5]
/lib64/libc.so.6(clone+0x6d)[0x2b08b85b9ead]
srun: error: miriel010: task 1: Exited with exit code 1



If I use commit just before your commit 0dbef4444, the problem disappears. I was not able to reproduce the bug on Daltons.

--
Philippe SWARTVAGHER

Doctorant
Équipe TADaaM, Inria Bordeaux Sud-Ouest





Archives gérées par MHonArc 2.6.19+.

Haut de le page