starpu-devel - Re: [Starpu-devel] Overlapping communication with computation on GPU

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Overlapping communication with computation on GPU

From: Samuel Thibault <samuel.thibault@ens-lyon.org>
To: Nabeel Alsaber <nabeel.alsaber1@gmail.com>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] Overlapping communication with computation on GPU
Date: Mon, 19 Jan 2015 17:35:49 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

Nabeel Alsaber, le Fri 16 Jan 2015 12:19:32 -0500, a écrit :
> If I remove the stream synchronizations cudaStreamSynchronize
> (starpu_cuda_get_local_stream(), I get overlapping with any scheduler.

Err, no, without synchronization the behavior becomes undefined. It
perhaps happens that it provides the correct result, but this is only
by luck. Without synchronization, there is no guarantee that the
computation kernel has finished before StarPU tries to bring back the
content to the main memory. It perhaps gives the right result in the
mult case only because all tasks are independant and the data is brought
back to the main memory at data unregistration.

It also happens that not synchronizing allows the StarPU driver to get
back hand and start data transfers earlier, but there is no needd for
this to get overlapping: with a scheduler that schedules in advance,
such as dmda or ws, StarPU will already start a dozen transfers in
advance so as to get transfer overlapping.

> But when I use the synchronizations (even with the dmda scheduler) I
> don't get overlapping.

What makes you think that you do not get overlapping? Are you reading
a trace? Did you make sure to have the performance model calibrated so
that dmda can actually work properly? Which parameter are you providing
to mult? For GPUs to be really interesting you need to have big tiles,
we have recently made mult use -x 3840 -y 3840 -z 3840 by default. Which
version of StarPU are you using?

> Is there a different way to synchronize the CUDA kernels which allows
> overlapping?

Not in StarPU 1.1.x, and there is no real need for it: as documented,
all you need to do to get overlapping is to allocate the host buffers
with starpu_malloc and synchronize with streams. The mult example does
both.

The trunk has an async mode, which allows to avoid the synchronization,
and thus allowing StarPU to be more reactive about data transfers, but
this is only a small improvement. This mode is *not* a requirement to
get data transfer overlap, the mult example of StarPU 1.1 is completely
able to achieve it as it is, when run with the dmda scheduler with
performance models properly calibrated.

Samuel

Re: [Starpu-devel] Overlapping communication with computation on GPU, Nabeel Alsaber, 16/01/2015
- Re: [Starpu-devel] Overlapping communication with computation on GPU, Samuel Thibault, 19/01/2015

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] Overlapping communication with computation on GPU