Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4


Chronologique Discussions 
  • From: Mirko Myllykoski <mirkom@cs.umu.se>
  • To: samuel.thibault@inria.fr
  • Cc: Starpu Devel <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] Performance decline from StarPU 1.2.3 to 1.2.4
  • Date: Tue, 02 Oct 2018 22:05:42 +0200
  • Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=mirkom@cs.umu.se; spf=Pass smtp.mailfrom=mirkom@cs.umu.se; spf=None smtp.helo=postmaster@mail.cs.umu.se
  • Ironport-phdr: 9a23:WlX4ghF6p8/GS0hD+vPXxp1GYnF86YWxBRYc798ds5kLTJ7zpc2wAkXT6L1XgUPTWs2DsrQY07WQ6/iocFdDyK7JiGoFfp1IWk1NouQttCtkPvS4D1bmJuXhdS0wEZcKflZk+3amLRodQ56mNBXdrXKo8DEdBAj0OxZrKeTpAI7SiNm82/yv95HJbAhEmDiwbaluIBmqsA7cqtQYjYx+J6gr1xDHuGFIe+NYxWNpIVKcgRPx7dqu8ZBg7ipdpesv+9ZPXqvmcas4S6dYDCk9PGAu+MLrrxjDQhCR6XYaT24bjwBHAwnB7BH9Q5fxri73vfdz1SWGIcH7S60/VC+85Kl3VhDnlCYHNyY48G7JjMxwkLlbqw+lqxBm3oLYfJ2ZOP94c6jAf90VWHBBU95eWCNdDY2yYYsBAfQcM+laoYnzpFQPohWlCAmwBu7vyCNEimPs0KEk1ekqDAHI3BYnH9ILqHnaq9T1NL0RUeCy0aLGyjXCb/dS2Tb964jIdQshofKNXbltdsfRzEgvFxnGjlWXrIzoJC+a1v8Xv2iG6upgSPiji3U5pAxopDWk28kiio7Mho0Py1DE8z10wIkyJd2/R057ZcCrHIFMuCGdMot7RN4pTWJwuCsi17ELt4K3cDIUxJkpwxPTcfOKf5SS7h79VOudOSl0iG55dL6ighu/8FOvxvH5W8aq1VtHoTZJn9bQun0I0hHe68uKR/1g9Um7wzmPzRrc6uRcLEA0i6XbL5khz6Y1lpUJsETDGjX6l1ntjKOMa0Uk//Wo5/78Yrr4vpOcNol0hR/iMqk2h8CyD/g0PhIQU2WV/emwzrLu8VHjTLlUjvA6iqzZv4rbJcQfqK65GQhV0oM75ha6DjemytcYnX4CLF9eZB2HlJLlO0zLIPDlF/u/mEqjnC9xx//aJr3hHonNLn/bnbf6YbZy8VRcyBIuzdxG+p1bFK8BL+z3WkLqsNzYDwQ5MxCvw+r9B9V92IQeWXiAAqCHKq/SsFmI5vguI+aWfoMVtiz9eLAZ4Kv1hHoklFtbYamo15IKbGyQH/J8Ikzfb2C/rM0GFDIvvxA9S6TPmVmGQD1UfHWzF/Y55ys4D4eOBpyFW4WwxqeMinToVqZKb3xLXwjfWUzjcJ+JDrJVMHrLc51R1wccXL3kcLcPkBSntQv00b1id7OG8TZeqJf+ksN4tbSKyUMCsAdsBsHY6FmjCnlulzpRFTQtmr16vApmxwXbiPUqs7ljDdVWoshxfEI6OJrblrIoDtnzXkTKZZGUTUvgWdj0WTw=
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

On 2018-10-01 15:59, Samuel Thibault wrote:
Mirko Myllykoski, le jeu. 27 sept. 2018 20:21:14 +0200, a ecrit:
On 2018-09-27 18:32, Samuel Thibault wrote:
> I'm afraid the issue you are getting is that main memory registration
> for CUDA DMA transfers is quite costly, and actually not worth the
> cost for the asynchronism benefit. Setting --disable-cuda-memcpy-peer
> by hand allows to disable asynchronism optimization and thus the data
> registration. Ideally registration would be cheap, or cached, to avoid
> the issue. If your temporary data pieces are small (less than 4MB), you
> could try the attached patch which will make StarPU use a suballocator
> not only for the CUDA memory but also for the main memory, which we
> didn't feel would be useful, but perhaps would be.

Compiling StarPU 1.2.4 with this patch (without --disable-cuda-memcpy-peer)
fixes the problem.

Ok, that's the best option: you keep memcpy-peer which avoids spurious
synchronization, but avoid paying the corresponding cost all the
time. I have commited a more elaborated version of the the patch to all
branches.

Thanks!
Samuel

Hi,

I believe that the patch did not fix the problem completely. I am now experiencing a similar problem with a different code. Again, the performance declines when I move from StarPU 1.2.3 to 1.2.4. Compiling StarPU with --disable-cuda-memcpy-peer fixes the problem but this time the patch seems to do nothing (I tried the one you sent me earlier and the commit 8c35781b0ac517304ac2f2461b243b4447c38ab3).

The code uses two scheduling contexts:
A) Uses peager or pheft. In this case pheft is used and the context contains a single GPU (the context can also contain some CPU workers).
B) Uses prio or dmdasd. In this case dmdasd is used and the context contains five CPU workers.

The code has three task types: process_panel, update_trail and update right. The first two are inserted to A and the last one is inserted B.

When compiled without --disable-cuda-memcpy-peer, the workers that belong to B report a lot more overhead. This does not happen when compiled with --disable-cuda-memcpy-peer. The code also runs slower when compiled without --disable-cuda-memcpy-peer.

Fxt trace with --disable-cuda-memcpy-peer: https://drive.google.com/open?id=1DuzEvPVSzB9lc1IjLlN44gjcw-mc7oN5

Fxt trace without --disable-cuda-memcpy-peer: https://drive.google.com/open?id=1oMTqXb4kYLQifbFSF2w5BU0Ivuz8kzHF

- Mirko




Archives gérées par MHonArc 2.6.19+.

Haut de le page