Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU
Chronologique Discussions
- From: David Pereira <david_sape@hotmail.com>
- To: Samuel Thibault <samuel.thibault@ens-lyon.org>
- Cc: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
- Subject: Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU
- Date: Sat, 20 Sep 2014 23:30:27 +0100
- Importance: Normal
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi,
Thank you Samuel for answering my questions.
I found the reason of the unspecified launch failure. It turns out that it was from destroying a cusparse handle (twice).
Is there a way to initialize cusparse across all GPU devices the same way we do with cublas (starpu_cublas_init)?
As for this question:
> > Also, is it possible that a CUDA program runs faster when using StarPU
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).
>
> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).
>
> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.
How can overlapping data transfers with computations improve execution time of the StarPU version (using only 1 GPU)
over running only in the GPU?, since my algorithm runs completely in the GPU without transfers to the CPU?
Also, even with parallel implementation of tasks, if I only execute on a single GPU (without concurrent kernels), I think I'm not supposed to have better results with StarPU (which I have). I can't seem to find a logical reason to explain this...
Thank you once again for your help!
Best regards,
--
David Pereira
> Date: Mon, 15 Sep 2014 13:00:10 +0200
> From: samuel.thibault@ens-lyon.org
> To: david_sape@hotmail.com
> CC: starpu-devel@lists.gforge.inria.fr
> Subject: Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU
>
> Hello,
>
> David Pereira, le Sat 13 Sep 2014 20:25:37 +0000, a écrit :
> > I have a problem which I cannot identify. When using the "pheft" or "dmda"
> > schedulers using
> > only one GPU (STARPU_NCUDA=1), the program runs smoothly, but when I use the
> > two GPUs that I have,
> > an unspecified launch failure occurs from a kernel launched by a task. This
> > error happens sometimes in one kernel but may also happen in other kernels.
> >
> > The "dm", "eager" and "peager" don't have this problem.
>
> Hum. dm and dmda are very similar, the difference essentially likes in
> the scheduling decision, dmda takes into account data transfer time,
> while dm does not, so dmda tends to reduce the number of data transfers.
>
> > Also, is it possible that a CUDA program runs faster when using StarPU
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).
>
> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.
>
> > I'm thinking that StarPU may be launching concurrent kernels... Is
> > that right?
>
> It does not do that unless you explicitly set STARPU_NWORKER_PER_CUDA,
> and that is implemented only in the trunk, not 1.1.x.
>
> > Last question, how can I know if GPUDirect is been used? I was analyzing
> > information from Vite and I think the transfers goes from a GPU
> > global memory to RAM first before going to the other GPU.
>
> It's not simple. Seeing the data go through the RAM in Vite
> clearly means GPUDirect was not used. Which data interface are you
> using? (vector/matrix/something else?) The interface has to have a
> cuda_to_cuda_async or any_to_any method for StarPU to be able to make
> GPU-GPU transfers. Then you will see transfers go from a GPU to another
> GPU in the vite trace. This however does not necessarily mean GPUDirect
> is enabled, notably when the two GPUs are not on the same NUMA node. One
> can see GPU-Direct enabled or not when the bus gets calibrated: after
> CUDA 0 -> 1, one can see GPU-Direct 0 -> 1 when it was possible to
> enable it. When it's not displayed, it means CUDA does not allow
> GPU-Direct, and thus even if StarPU submits GPU-GPU transfers, and vite
> shows them as such, the CUDA driver actually makes the data go through
> the host (but more efficiently than if StarPU submitted separate GPU-CPU
> + CPU-GPU transfers)
>
> Samuel
> From: samuel.thibault@ens-lyon.org
> To: david_sape@hotmail.com
> CC: starpu-devel@lists.gforge.inria.fr
> Subject: Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU
>
> Hello,
>
> David Pereira, le Sat 13 Sep 2014 20:25:37 +0000, a écrit :
> > I have a problem which I cannot identify. When using the "pheft" or "dmda"
> > schedulers using
> > only one GPU (STARPU_NCUDA=1), the program runs smoothly, but when I use the
> > two GPUs that I have,
> > an unspecified launch failure occurs from a kernel launched by a task. This
> > error happens sometimes in one kernel but may also happen in other kernels.
> >
> > The "dm", "eager" and "peager" don't have this problem.
>
> Hum. dm and dmda are very similar, the difference essentially likes in
> the scheduling decision, dmda takes into account data transfer time,
> while dm does not, so dmda tends to reduce the number of data transfers.
>
> > Also, is it possible that a CUDA program runs faster when using StarPU
> > (using one GPU) than running the program without the framework? The
> > StarPU version has parallel tasks unlike the version without the
> > framework (which does not use streams).
>
> StarPU implements a lot of optimisations such as overlapping data
> transfers with computations, which are often not easy to code directly
> in the application. The parallel implementation of tasks can also of
> course considerably reduce the critical path in the task graph.
>
> > I'm thinking that StarPU may be launching concurrent kernels... Is
> > that right?
>
> It does not do that unless you explicitly set STARPU_NWORKER_PER_CUDA,
> and that is implemented only in the trunk, not 1.1.x.
>
> > Last question, how can I know if GPUDirect is been used? I was analyzing
> > information from Vite and I think the transfers goes from a GPU
> > global memory to RAM first before going to the other GPU.
>
> It's not simple. Seeing the data go through the RAM in Vite
> clearly means GPUDirect was not used. Which data interface are you
> using? (vector/matrix/something else?) The interface has to have a
> cuda_to_cuda_async or any_to_any method for StarPU to be able to make
> GPU-GPU transfers. Then you will see transfers go from a GPU to another
> GPU in the vite trace. This however does not necessarily mean GPUDirect
> is enabled, notably when the two GPUs are not on the same NUMA node. One
> can see GPU-Direct enabled or not when the bus gets calibrated: after
> CUDA 0 -> 1, one can see GPU-Direct 0 -> 1 when it was possible to
> enable it. When it's not displayed, it means CUDA does not allow
> GPU-Direct, and thus even if StarPU submits GPU-GPU transfers, and vite
> shows them as such, the CUDA driver actually makes the data go through
> the host (but more efficiently than if StarPU submitted separate GPU-CPU
> + CPU-GPU transfers)
>
> Samuel
- [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, David Pereira, 13/09/2014
- Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, Samuel Thibault, 15/09/2014
- Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, David Pereira, 21/09/2014
- Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, Samuel Thibault, 22/09/2014
- Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, David Pereira, 21/09/2014
- Re: [Starpu-devel] StarPU fails when using dmda/pheft with more than 1 GPU, Samuel Thibault, 15/09/2014
Archives gérées par MHonArc 2.6.19+.