Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance decreasing by adding empty tasks

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance decreasing by adding empty tasks


Chronologique Discussions 
  • From: Xavier Lacoste <xavier.lacoste@inria.fr>
  • To: Samuel Thibault <samuel.thibault@ens-lyon.org>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] Performance decreasing by adding empty tasks
  • Date: Fri, 24 Feb 2012 12:09:49 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>


Le 24 févr. 2012 à 08:03, Xavier Lacoste a écrit :

>
> Le 23 févr. 2012 à 18:31, Samuel Thibault a écrit :
>
>> Xavier Lacoste, le Tue 21 Feb 2012 10:25:42 +0100, a écrit :
>>> I can understand that most of the time is spent with data management,
>>> but that's all i can say...
>>
>> Most of these are actually idle handlers, so it's "normal" that they
>> take most of the time. The processors are just not enough fed with tasks
>> to process.
>>
>>> I found a workaround, by managing this buffer MySelf, copying it on all
>>> CUDA devices at the begining of the run and getting the good one in the
>>> CUDA kernel (I give it an array d_blocktab of ndevices cuda ptr, and
>>> choose the right one in the kernel with cudaGetDevice).
>>>
>>> With this workaround I loose no time adding my global READ-Only buffer.
>>
>> Ok, so it's really an issue in StarPU. I can see several potential
>> explanations:
>>
>> * The read-only buffer gets evicted from GPU memory when StarPU has to
>> make room for new data. And thus it has to be brought back in. I guess
>> the data of Pastix don't fit entirely in the GPU memory, so it has to do
>> eviction.
> I don't think that's the case, the problem occurs even with NCUDA=0.
>>
>> A way to avoid this is to use the write-through mask: call
>> starpu_data_set_wt_mask() on the data, and the data will be replicated
>> on all memory nodes, and kept there and prevented from eviction. I
>> have just commited the last property (which is the one we need here)
>> in revision r5790, so it's not available in earlier revisions, but the
>> change is trivial to backport to earlier revisions.
>>
>> * Since all tasks access the same data in read mode, there is
>> competition on the spinlock managing the mode of the data. I however
>> guess that your tasks are long enough for the competition to be loose
>> enough.
>>
>
> I can have many small tasks in the beginning of the factorization, i'll
> compute the average size of my Sparse GEMMs to see.
>
>> * All tasks register themselves as readers for
>> the implicit dependencies on the data. This makes
>> _starpu_release_data_enforce_sequential_consistency have to browse a
>> very big list and while everybody waits for it. Please try to call
>> starpu_data_set_sequential_consistency_flag(handle, 0) on that data, to
>> disable implicit dependency computation.
>>
> Ok, I will try that! Thanks.

Using this I get performance (with STARPU_NCUDA=0 or 3) close to what I get
when I manually copy data on GPUs (I do this on all GPu whatever STARPU_NCUDA
is).
It's still a little longer but only about 10-25% more with NCUDA=0 and
NCUDA=3.

>> Samuel
>
>
> _______________________________________________
> Starpu-devel mailing list
> Starpu-devel@lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-devel






Archives gérées par MHonArc 2.6.19+.

Haut de le page