Objet : Developers list for StarPU
Archives de la liste
- From: Xavier Lacoste <xl64100@gmail.com>
- To: Samuel Thibault <samuel.thibault@ens-lyon.org>
- Cc: starpu-devel@lists.gforge.inria.fr
- Subject: Re: [Starpu-devel] Assert : Number of copy requests left is not zero
- Date: Tue, 4 Nov 2014 16:24:49 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Re-hello,
I added a timer to count the time to submit task, it can vary much (from 1.5s
to 6s on a case that took 15 s to factorize when it worked...)
This is note negligible. Thus, as far as I understand, this can explain the
benefit on memory and time when flushing as soon as it is possible.
The deadlock seems to be reproducible on this case (Fan-out, audi, 8MPI,
submit reception tasks at start).
I printed the GEMM(a,b) and incoming GEMM(a,b) submission and wrote a python
program to check that they were corresponding and it seem to be the case.
For each GEMM involving a reception from Pj on Pi I have a sending GEMM on Pj
to Pi but the program hangs. I don't look at POTRF/TRSM as they are always
local, and their are no ADD in this version.
(I also noticed that i was flushing the wrong buffer in GEMM(a,b), I was
flushing B which was not local whereas it was A that could be sent.
Anyway, changing this did not have effect on the deadlock.
In the incoming GEMM case its OK i'm flushing A, which is received).
Regards,
XL.
Le 4 nov. 2014 à 11:18, Xavier Lacoste <xl64100@gmail.com> a écrit :
>
> Le 4 nov. 2014 à 11:11, Samuel Thibault <samuel.thibault@ens-lyon.org> a
> écrit :
>
>> Xavier Lacoste, le Tue 04 Nov 2014 10:49:52 +0100, a écrit :
>>>>> [starpu][_starpu_mpi_early_data_check_termination][assert failure]
>>>>> Number of copy requests left is not zero
>>>>>
>>>>> Have you get an idea of what could cause this assert ?
>>>>
>>>> I have added a note in the message: did you forget to post a receive
>>>> corresponding to a send?
>>>
>>> Hmm I'll have a look at it.
>>> Can it be that I flush a data earlier than it should be ?
>>
>> Ah, I'm realizing: you are not using starpu_mpi_send, starpu_mpi_recv
>> and alike explicitly, and always rely on the communications implicitly
>> generated by starpu_mpi_task_insert?
> Yes, indeed.
>>
>> Normally, if all MPI nodes are doing exactly the same submission loop,
>> flushing data earlier is not a problem: the node needing it later will
>> receive it again, and the node owning it will know that the node needing
>> it later will need it, and thus sending it again.
> I'm not doing the same submission loop on all MPI nodes.
> I'm inserting only tasks using local data (as I tried to explain in the
> algorithms in my previous mail).
> Thus a mistake is possible here in my side.
>> Perhaps there is a
>> bug in StarPU-MPI, but we already test this scenario, so I'd rather
>> first make sure that the application is really running the same
>> submission loop first (perhaps you have made mistakes while pruning the
>> submission, and thus the node owning the data doesn't know it has to
>> send it again, or conversely).
> Yes, I'll check my submission loops.
>>
>> Samuel
>
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail
- [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Samuel Thibault, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Samuel Thibault, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Samuel Thibault, 04/11/2014
- Re: [Starpu-devel] Assert : Number of copy requests left is not zero, Xavier Lacoste, 04/11/2014
Archives gérées par MHonArc 2.6.19+.