Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Problem with starpu_mpi_wait_for_all and starpu_mpi_barrier functions

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Problem with starpu_mpi_wait_for_all and starpu_mpi_barrier functions


Chronologique Discussions 
  • From: Mirko Myllykoski <mirkom@cs.umu.se>
  • To: "COUTEYEN, Jean-marie" <jean-marie.couteyen-carpaye@airbus.com>
  • Cc: Starpu Devel <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] Problem with starpu_mpi_wait_for_all and starpu_mpi_barrier functions
  • Date: Fri, 09 Nov 2018 13:25:48 +0100
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=mirkom@cs.umu.se; spf=Pass smtp.mailfrom=mirkom@cs.umu.se; spf=None smtp.helo=postmaster@mail.cs.umu.se
  • Ironport-phdr: 9a23:FhxkLhdcRoLJ7NudA0/8iLRRlGMj4u6mDksu8pMizoh2WeGdxcS4Zh7h7PlgxGXEQZ/co6odzbaO7Oa4ASQp2tWoiDg6aptCVhsI2409vjcLJ4q7M3D9N+PgdCcgHc5PBxdP9nC/NlVJSo6lPwWB6nK94iQPFRrhKAF7Ovr6GpLIj8Swyuu+54Dfbx9HiTahY75+Ngm6oRnMvcQKnIVuLbo8xAHUqXVSYeRWwm1oJVOXnxni48q74YBu/SdNtf8/7sBMSar1cbg2QrxeFzQmLns65Nb3uhnZTAuA/WUTX2MLmRdVGQfF7RX6XpDssivms+d2xSeXMdHqQb0yRD+v9LlgRgP2hygbNj456GDXhdJ2jKJHuxKquhhzz5fJbI2JKPZye6XQds4YS2VcRMZcTyNODZ+zYYUBD+QPI/tWoIvzp1UNohqxCxKhBP/txz9KmnP6wbc33/onHArb3AIgBdUOsHHModvyNacSS+O1zK7VxjvEb/JW3TP96YjLchAmuvGMXrNwetfWxEkqFgPFlFaQqYvgPz6OyusNqHKX7/dlVeKykWInsB9+ryGpy8wxiYfJnpoYxk3K+Cll2oo5O9O1RUphbdOrDJdcrT+WOotuTs88X21kpDs2x7gHtJGgYCQHzYooyhvQZvCbfIWE/hfuWeOQLDp7gn9uZaixiAyo8Ue6z+3xTsm030hOripCitTMs2oC1x3X6sSdVvR95V2t2SuK1wDO8O1EOl47mbLaK54n3LEwioIevVnNEyPqgkn6kqGbe0E+9uWn9+jreKvqq5+EO49xkA7+M6AumsKlAeQ/NwgDR22b+eWm1L3g+k35Ra5HgeEtkqXDrZDaINkbqrSiAwBLyooj8QqwDy+60NQEmnkKNElFeA6dgIjzI1HOPen0AuqhjFSyjjhrw+vLPrngApXWMnjDi63tfblz605b0gozws5Q64hVCrEHOvLzW1X+uMbWDh8jYESIxLOtI/JA6q4vdETLSo2UOaWXtFaS5+9la72OaYYT/TP0MfkoofXpkGM0iFIbOKKgx4MeeX2QF/V8KViCJ3Hrh4FSP30Nu18bTfbpjxWnQDpXdnW1RKE9rmU+CZilCoLrTZvrnbmcmjy2SM4FLltaA0yBRC+7P76PXO0BPWfLepc4w240EIO5Qopk7imA8Qrzyr5pNO3Ro3RKvomlyd1oofbex0hrqW5ESv+F2mTIdFla23sSTmZvjqtk51F41xGY3Pog2qEKJZlo//pMFzwCG9vcwuh9Uo6gXwvAepGCUxC7R8jgGjxjFt8=
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

On 2018-11-09 12:34, COUTEYEN, Jean-marie wrote:

I used to have a problem with starpu_mpi_wait_for_all also. It was in
the context of explicit communication. The fact is that this function
may call _starpu_mpi_barrier a different number of times by processes
since there is no guaranty that
"_starpu_task_wait_for_all_and_return_nb_waited_tasks" will be
non-zero on every processes. (I used to have one process for which the
value returned was non-zero and this process then called
starpu_mpi_barrier once more than the other processes, since it enters
the "while" one more time).
If you can attach with a debugguer, if it is the same problem as I
used to have, you will have on process stuck in starpu_mpi_barrier
while the others already passed the starpu_mpi_wait_for_all.

This seems to be situation. I also added a break point to the MPI_Barrier() function and it appears that sometimes nodes call the function different number of times.

But you might not need this function :
"starpu_mpi_barrier" is supposed to wait all tasks also before
exiting. When I use it, I do not have any problems in my case.

Is this really true? The StarPU handbook states only the following: "Block the caller until all group members of the communicator comm have called it."

I have implemented a temporary fix that calls the starpu_mpi_barrier() function before the starpu_mpi_wait_for_all() function.

However... based on your previous mails, it seems you are using
different submitting contexts? Moreover, did you change the current
submitting context (using starpu_sched_ctx_set_context) ?
"starpu_task_wait_for_all" only waits for task of the current
submitting context. Since starpu_mpi_barrier rely on this function, it
might be the origin of the deadlock.

Some of my codes use different scheduling context, some don't. I will have to check this. Thanks for the tip.

However, it is a bit weird that the code I attached to my original email seems to reproduce the problem even though it does not use different scheduling contexts. The problem seems to be timing sensitive so the code does not always trigger it.

Without modifying StarPU, I don't think it is possible to reset the
current submitting context to none? If it is possible, I would try to
replace the "starpu_mpi_wait_for_all" in your code by a
starpu_mpi_barrier preceded by the reset of the current submitting
context.

I don't quite understand. Do you mean that I should set the scheduling context back to the default scheduling context and then call the starpu_mpi_wait_for_all() function?

Best Regards,
Mirko Myllykoski




Archives gérées par MHonArc 2.6.19+.

Haut de le page