Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4


Chronologique Discussions 
  • From: Samuel Thibault <samuel.thibault@inria.fr>
  • To: Mirko Myllykoski <mirkom@cs.umu.se>, Starpu Devel <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4
  • Date: Thu, 27 Sep 2018 18:32:50 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
  • Organization: I am not organized

Samuel Thibault, le jeu. 27 sept. 2018 18:08:54 +0200, a ecrit:
> Mmmm, could you post your config.log?

Also you could try configuring with --disable-cuda-memcpy-peer.

What does your data look like? I guess it is different allocation sizes?
How do you allocate them, with starpu_malloc? Do you use temporary
buffers?

I'm afraid the issue you are getting is that main memory registration
for CUDA DMA transfers is quite costly, and actually not worth the
cost for the asynchronism benefit. Setting --disable-cuda-memcpy-peer
by hand allows to disable asynchronism optimization and thus the data
registration. Ideally registration would be cheap, or cached, to avoid
the issue. If your temporary data pieces are small (less than 4MB), you
could try the attached patch which will make StarPU use a suballocator
not only for the CUDA memory but also for the main memory, which we
didn't feel would be useful, but perhaps would be.

Samuel
diff --git a/src/datawizard/malloc.c b/src/datawizard/malloc.c
index 632e291a0..b63b0913c 100644
--- a/src/datawizard/malloc.c
+++ b/src/datawizard/malloc.c
@@ -887,7 +887,9 @@ uintptr_t
starpu_malloc_on_node_flags(unsigned dst_node, size_t size, int flags)
{
/* Big allocation, allocate normally */
- if (size > CHUNK_ALLOC_MAX || starpu_node_get_kind(dst_node) !=
STARPU_CUDA_RAM)
+ if (size > CHUNK_ALLOC_MAX ||
+ ( starpu_node_get_kind(dst_node) != STARPU_CUDA_RAM
+ && starpu_node_get_kind(dst_node) != STARPU_CPU_RAM))
return _starpu_malloc_on_node(dst_node, size, flags);

/* Round up allocation to block size */
@@ -985,7 +987,9 @@ void
starpu_free_on_node_flags(unsigned dst_node, uintptr_t addr, size_t size,
int flags)
{
/* Big allocation, deallocate normally */
- if (size > CHUNK_ALLOC_MAX || starpu_node_get_kind(dst_node) !=
STARPU_CUDA_RAM)
+ if (size > CHUNK_ALLOC_MAX ||
+ ( starpu_node_get_kind(dst_node) != STARPU_CUDA_RAM
+ && starpu_node_get_kind(dst_node) != STARPU_CPU_RAM))
{
_starpu_free_on_node_flags(dst_node, addr, size, flags);
return;



Archives gérées par MHonArc 2.6.19+.

Haut de le page