starpu-devel - [Starpu-devel] MPI scaling

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] MPI scaling

From: Xavier Lacoste <xavier.lacoste@inria.fr>
To: starpu-devel@lists.gforge.inria.fr
Subject: [Starpu-devel] MPI scaling
Date: Mon, 23 Jun 2014 09:19:19 +0200
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Good morning,

I'm currently trying to evaluate the MPI version of PaStiX with StarPU.

I'm using only CPUs for the moment.

All this results were obtained on fourmi's nodes from PlaFRIM.

I have two version of the MPI version :

Fan-out : when their is an update from a column block A on proc 0 onto an other column block B on proc 1, the column block A is sent to proc 1 to perform the update.

Fan-in : Correspond to what is done in native PaStiX. A temporary column block is created to receive all the updates from proc 0 onto B on proc 1. Once all updates have been performed, the column block is sent to 1 and added to B on proc 1. In native PaStiX it's the same except this column block temporary buffer are divided into blocks which can have different width (in fact, when I look into it the width doesn't change much and the number of coefficients exchanged is quite the same). This block are gathered and the communication is sent when the volume of data is large enough (less than one column block on the tests cases I looked).

I'm using eager scheduler has I have no GPUs and when I'll had some, as I use static scheduling, eager is enough. (DMDA gave worst results with MPI)

Within one node, with only shared memory StarPU version is competitive compared to PaStiX (+7,5% time).

Starpu With FanIn :

audi_1x8_fanin_20140618_163334: Time to factorize 129 s

With FanOut :

audi_1x8_fanou_20140618_163334: Time to factorize 129 s

PaStiX native :

audi_1x8_pastix_20140618_163334: Time to factorize 120 s

With 2 nodes, as expected Fan-out version takes more time to execute, scaling with StarPU is not as good as without :

audi_2x8_fanin_20140618_163334: Time to factorize 87.7 s (+24.5%)

audi_2x8_fanou_20140618_163334: Time to factorize 84.8 s (+20.4%)

audi_2x8_pastix_20140618_163334: Time to factorize 70.4 s

With 4 nodes :

audi_4x8_fanin_20140618_163334: Time to factorize 58 s (+54%)

audi_4x8_fanou_20140618_163334: Time to factorize 67.2 s (+79%)

audi_4x8_pastix_20140618_163334: Time to factorize 37.5 s

Using only 7 threads gives better results, StarPU communication thread may wait less time before sending/receiving contributions :

audi_4x7_fanin_20140619_162457: Time to factorize 57.6 s (+26%)

audi_4x7_fanou_20140619_162457: Time to factorize 63.5 s (+40%)

audi_4x7_pastix_20140619_162457: Time to factorize 45.2 s

In native PaStiX, communication are done in a MPI_THREAD_MULTIPLE way, I could activate the FUNNELED mode, to get closer to StarPU (one communication thread performing sending, receiving, and addition of received data) but it wouldn't take 26% time more...

with 8 nodes :

audi_8x7_fanin_20140619_144537: Time to factorize 44.8 s (+37%)

audi_8x7_fanou_20140619_144537: Time to factorize 56.8 s (+74%)

audi_8x7_pastix_20140619_144537: Time to factorize 32.6 s

I attach the 4x8 execution trace. I would like to know is their is a mean to see if a communication is triggered as soon as the last local update as been performed.

Have you got any advice to try to improve MPI scaling. In my case here, using the whole 8 cores of a node is worse than using 7 threads per node... Have you already experience that ?

Regards,

XL.

----------------------------------------

Xavier Lacoste

xavier.lacoste@inria.fr

INRIA Bordeaux Sud-Ouest

200, avenue de la Vieille Tour

33405 Talence Cedex

Tél : +33 (0)5 24 57 40 69

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

[Starpu-devel] MPI scaling, Xavier Lacoste, 23/06/2014
- Re: [Starpu-devel] MPI scaling, Samuel Thibault, 24/06/2014
  - Re: [Starpu-devel] MPI scaling, Xavier Lacoste, 24/06/2014
    - Re: [Starpu-devel] MPI scaling, Xavier Lacoste, 24/06/2014
    - Re: [Starpu-devel] MPI scaling, Samuel Thibault, 24/06/2014

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

[Starpu-devel] MPI scaling