Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU


Chronologique Discussions 
  • From: David Pereira <david_sape@hotmail.com>
  • To: Samuel Thibault <samuel.thibault@ens-lyon.org>
  • Cc: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU
  • Date: Sat, 20 Sep 2014 23:30:27 +0100
  • Importance: Normal
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

Thank you Samuel for answering my questions.

I found the reason of the unspecified launch failure. It turns out that it was from destroying a cusparse handle (twice).
Is there a way to initialize cusparse across all GPU devices the same way we do with cublas (starpu_cublas_init)?

As for this question:

> > Also, is it possible that a CUDA program runs faster when using StarPU
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).

> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.

How can overlapping data transfers with computations improve execution time of the StarPU version (using only 1 GPU)
over running only in the GPU?, since my algorithm runs completely in the GPU without transfers to the CPU?
Also, even with parallel implementation of tasks, if I only execute on a single GPU (without concurrent kernels), I think I'm not supposed to have better results with StarPU (which I have). I can't seem to find a logical reason to explain this...


Thank you once again for your help!

Best regards,
--
David Pereira


> Date: Mon, 15 Sep 2014 13:00:10 +0200
> From: samuel.thibault@ens-lyon.org
> To: david_sape@hotmail.com
> CC: starpu-devel@lists.gforge.inria.fr
> Subject: Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU

> Hello,

> David Pereira, le Sat 13 Sep 2014 20:25:37 +0000, a écrit :
> > I have a problem which I cannot identify. When using the "pheft" or "dmda"
> > schedulers using
> > only one GPU (STARPU_NCUDA=1), the program runs smoothly, but when I use the
> > two GPUs that I have,
> > an unspecified launch failure occurs from a kernel launched by a task. This
> > error happens sometimes in one kernel but may also happen in other kernels.
> > 
> > The "dm", "eager" and "peager" don't have this problem.

> Hum. dm and dmda are very similar, the difference essentially likes in
> the scheduling decision, dmda takes into account data transfer time,
> while dm does not, so dmda tends to reduce the number of data transfers.

> > Also, is it possible that a CUDA program runs faster when using StarPU
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).

> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.

> > I'm thinking that StarPU may be launching concurrent kernels... Is
> > that right?

> It does not do that unless you explicitly set STARPU_NWORKER_PER_CUDA,
> and that is implemented only in the trunk, not 1.1.x.

> > Last question, how can I know if GPUDirect is been used? I was analyzing
> > information from Vite and I think the transfers goes from a GPU
> > global memory to RAM first before going to the other GPU.

> It's not simple. Seeing the data go through the RAM in Vite
> clearly means GPUDirect was not used. Which data interface are you
> using? (vector/matrix/something else?) The interface has to have a
> cuda_to_cuda_async or any_to_any method for StarPU to be able to make
> GPU-GPU transfers. Then you will see transfers go from a GPU to another
> GPU in the vite trace. This however does not necessarily mean GPUDirect
> is enabled, notably when the two GPUs are not on the same NUMA node. One
> can see GPU-Direct enabled or not when the bus gets calibrated: after
> CUDA 0 -> 1, one can see GPU-Direct 0 -> 1 when it was possible to
> enable it. When it's not displayed, it means CUDA does not allow
> GPU-Direct, and thus even if StarPU submits GPU-GPU transfers, and vite
> shows them as such, the CUDA driver actually makes the data go through
> the host (but more efficiently than if StarPU submitted separate GPU-CPU
> + CPU-GPU transfers)

> Samuel



Archives gérées par MHonArc 2.6.19+.

Haut de le page