Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance with SOCL on multiple devices

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance with SOCL on multiple devices


Chronologique Discussions 
  • From: Malcolm Roberts <malcolm.i.w.roberts@gmail.com>
  • To: Samuel Thibault <samuel.thibault@inria.fr>, starpu-devel@lists.gforge.inria.fr, "helluy@math.unistra.fr" <helluy@math.unistra.fr>, Bruno Weber <bruno.weber@axessim.fr>, Thomas Strub <thomas.strub@axessim.fr>
  • Subject: Re: [Starpu-devel] Performance with SOCL on multiple devices
  • Date: Wed, 13 Jan 2016 15:55:58 +0100
  • Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=malcolm.i.w.roberts@gmail.com; spf=Pass smtp.mailfrom=malcolm.i.w.roberts@gmail.com; spf=None smtp.helo=postmaster@mail-wm0-f50.google.com
  • Ironport-phdr: 9a23:sQOimRCpWqUXwhIVfM8HUyQJP3N1i/DPJgcQr6AfoPdwSPj9p8bcNUDSrc9gkEXOFd2CrakU1ayG7Ou+AiQp2tWojjMrSNR0TRgLiMEbzUQLIfWuLgnFFsPsdDEwB89YVVVorDmROElRH9viNRWJ+iXhpQAbFhi3DwdpPOO9QteU1JTpkbDtsMOIKyxzxxODIppKZC2sqgvQssREyaBDEY0WjiXzn31TZu5NznlpL1/A1zz158O34YIxu38I46Fp34d6XK77Z6U1S6BDRHRjajhtpZ6jiR6WdgKK+3YYGlkWkxBBHgzZpEXhV5Lsvy+8qup80iCHOdHeTLYuWD3k4b09DFfzlC4dLyN8/GzJh8hYiKNAvAnnqBJ42YHZJoCTLvt3OK3HOZtQRWdFWttAfylIHoP6co0OFPYbNKBWtcO181sDqR+jFCGpDf/vjCJOh2Tqx6R83f53VUnsxhEmGJoxvX7Ztp2hPr0PV+fz06TCwC/rZuNbwiz87c7GaEZl6c2MWrd5aoLq1VMyHUuRklWXrIX+eSiJzPkHm2GL4vF7VOfphXRx+C9rpT36484ogY7Ng8ovzVrJ7zliiNIwLNmzT0p+J9/iD91KsCuXLZdtaswnSmBs/i09z+tV6taAYCEWxcF/lFbkYPudft3Nu0q7WQ==
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello.

I've spent some more time getting things to run in parallel, and I've had some interesting results. In particular, I can now launch waves of DGMacroCellInterface kernels in parallel. However, this seem to do nothing but slow things down. I used nvvp to visualize what the scheduling looked like, and it seems that launching things in parallel (with lots of OpenCL events in each kernel's wait list) adds a lot of overhead. The inter-kernel time was around 50 times larger than when I was launching kernels in sequence. I looked at what this looks like under SOCL as well, and it seems that this is the same problem; the overhead involved in launching kernels delays the start of execution, and it's this which is causing the performance degradation.

I'm not sure what the best strategy is right now, but I think that I'll spend some time on improving the individual kernels while I think about what to do next.

Thanks for your help!

Best,

~Malcolm

On 06/01/2016 17:16, Samuel Thibault wrote:
Samuel Thibault, on Wed 06 Jan 2016 17:03:42 +0100, wrote:
With these changes, I do see kernels running on various GPUs. It does
not seem faster, but that is probably due to data transfers.
And also serialized kernels

You will probably want to use FxT to dump traces and read them with
Vite.
It notably seems that the DGMacroCellInterface kernels are completely
serialized, thus awfully bad performance due to no parallelism and
data transfers :) DGFlux does get parallelized, on the other hand. The
duration is however quite tiny (60µs), so it's really not efficient
compared with the runtime overhead.

Samuel

--
http://malcolmiwroberts.com





Archives gérées par MHonArc 2.6.19+.

Haut de le page