Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Worker Binding Problem

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Worker Binding Problem


Chronologique Discussions 
  • From: Andra Hugo <andra.hugo@inria.fr>
  • To: Berenger Bramas <berenger.bramas@inria.fr>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] Worker Binding Problem
  • Date: Wed, 9 Sep 2015 14:45:03 +0200 (CEST)
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

I think starpu reuses the openmp threads (previously binded)... do you want
this? If not you could probably do an omp_set_num_threads to 1 after your
openmp code so that starpu creates new ones.

Andra

----- Mail original -----
> De: "Berenger Bramas" <berenger.bramas@inria.fr>
> À: "Samuel Thibault" <samuel.thibault@inria.fr>
> Cc: starpu-devel@lists.gforge.inria.fr
> Envoyé: Mercredi 9 Septembre 2015 14:32:55
> Objet: Re: [Starpu-devel] Worker Binding Problem
>
> Ok, it looks like it is not because of StarPU,
> In the test file, I do something like:
> ==========================
> openmp_test();
>
> starpu_test();
> ==========================
>
> If I comment the OpenMP test, the starpu test is perfect...
> More precisely, for any openmp threads (excluding the master) that is added
> and bind, it makes the starpu thread created later slower.
>
> If I do (5 OMP BIND 3 STARPU):
> ======================================================
> OMP_DYNAMIC=false OMP_NUM_THREADS=5 OMP_PROC_BIND=TRUE STARPU_NCPUS=3
> numactl
> -l ./testStarPUOpenMPv2.exe
> Test with OpenMP:
> [1]Done = 0.721296s
> [2]Done = 0.721693s
> [4]Done = 0.721859s
> [3]Done = 0.721989s
> [0]Done = 0.72314s
> Starpu:
> [0]Done = 0.722181s
> [1]Done = 1.44202s
> [2]Done = 1.45352s
> ======================================================
>
> Now the same but if I remove the binding for openmp (5 OMP NO-BIND 3
> STARPU):
> ======================================================
> OMP_DYNAMIC=false OMP_NUM_THREADS=5 OMP_PROC_BIND=FALSE STARPU_NCPUS=3
> numactl -l ./testStarPUOpenMPv2.exe
> OpenMP:
> [2]Done = 0.721822s
> [3]Done = 0.722332s
> [4]Done = 0.72412s
> [0]Done = 0.729278s
> [1]Done = 0.72942s
> Starpu:
> [2]Done = 0.721866s
> [1]Done = 0.722248s
> [0]Done = 0.729274s
> ======================================================
>
> What if I used only one OpenMP threads (just the master) and bind it (1 OMP
> BIND 3 STARPU):
> ======================================================
> OMP_DYNAMIC=false OMP_NUM_THREADS=1 OMP_PROC_BIND=FALSE STARPU_NCPUS=3
> numactl -l ./testStarPUOpenMPv2.exe
> Test with OpenMP:
> [0]Done = 0.726867s
> Test with StarPU:
> [2]Done = 0.721073s
> [1]Done = 0.722398s
> [0]Done = 0.725718s
> ======================================================
>
> I set OMP_DYNAMIC=false to ensure the OpenMP thread to not spin while
> waiting
> (even if I am out of the OpenMP parallel section when I execute StarPU test
> so I suppose the GCC implementation put the thread to sleep anyway)
> So I clearly do not understand why there is this behavior.
> But in my real application I have something similar, an OpenMP
> precomputation
> stage and a StarPU execution
> (and I suppose I have the same problem there even if I need to check in
> details).
>
>
> Bérenger Bramas
>
> HiePACS Project
>
> Tel (05 24 57) 40 76
> INRIA BORDEAUX Sud Ouest
>
>
> ----- Mail original -----
> | De: "Berenger Bramas" <berenger.bramas@inria.fr>
> | À: "Samuel Thibault" <samuel.thibault@inria.fr>
> | Cc: starpu-devel@lists.gforge.inria.fr
> | Envoyé: Mercredi 9 Septembre 2015 13:39:27
> | Objet: Re: [Starpu-devel] Worker Binding Problem
> |
> | I checked, and it looks like OpenMP is also using the default system
> | values.
> | only the openmp thread 0 has a small difference.
> |
> | - Here is the output for 3 threads:
> | ===========================================
> | Test with OpenMP:
> | [0] stackaddr 0x7fff6a08d000
> | [0] stacksize 8380416
> | [0] guard_size 0
> | [1] stackaddr 0x7faa06236000
> | [1] stacksize 8392704
> | [1] guard_size 4096
> | [2] stackaddr 0x7faa05a35000
> | [2] stacksize 8392704
> | [2] guard_size 4096
> | [2]Done = 0.81137s
> | [0]Done = 0.812685s
> | [1]Done = 0.827188s
> |
> | Test with StarPU:
> | [0] stackaddr 0x7faa05234000
> | [0] stacksize 8392704
> | [0] guard_size 4096
> | [2] stackaddr 0x7faa04232000
> | [2] stacksize 8392704
> | [2] guard_size 4096
> | [1] stackaddr 0x7faa04a33000
> | [1] stacksize 8392704
> | [1] guard_size 4096
> | [0]Done = 0.810964s
> | [2]Done = 0.827862s
> | [1]Done = 1.61425s
> | ===========================================
> |
> |
> | If I use alloca(16384); and then alloca to store my counter, I get:
> | ===========================================
> | Test with OpenMP:
> | [1] stackaddr 0x7f9cf64f0000
> | [1] stacksize 8392704
> | [1] guard_size 4096
> | [0] stackaddr 0x7ffe53726000
> | [0] stacksize 8384512
> | [0] guard_size 0
> | [2] stackaddr 0x7f9cf5cef000
> | [2] stacksize 8392704
> | [2] guard_size 4096
> | [1]Done = 0.721168s
> | [0]Done = 0.721216s
> | [2]Done = 0.722121s
> |
> | Test with StarPU:
> | [0] stackaddr 0x7f9cf54ee000
> | [0] stacksize 8392704
> | [0] guard_size 4096
> | [starpu][starpu_task_wait_for_all] Waiting for tasks submitted to context > 0
> | [2] stackaddr 0x7f9cf44ec000
> | [2] stacksize 8392704
> | [2] guard_size 4096
> | [1] stackaddr 0x7f9cf4ced000
> | [1] stacksize 8392704
> | [1] guard_size 4096
> | [0]Done = 0.743813s
> | [2]Done = 0.744685s
> | [1]Done = 1.52117s
> | ===========================================
> |
> | This improve the performance of all the threads but there is still one
> | thread
> | that is slow with StarPU.
> |
> | If I changed the stack size (ulimit -s 699999) ~ 700MB and use alloca
> | ===========================================
> | Test with OpenMP:
> | [2] stackaddr 0x7f3b441f8000
> | [2] stacksize 716800000
> | [2] guard_size 4096
> | [1] stackaddr 0x7f3b6ed90000
> | [1] stacksize 716800000
> | [1] guard_size 4096
> | [0] stackaddr 0x7ffec0149000
> | [0] stacksize 716787712
> | [0] guard_size 0
> | [2]Done = 0.721143s
> | [1]Done = 0.72139s
> | [0]Done = 0.754094s
> |
> | Test with StarPU:
> | [0] stackaddr 0x7f3b09467000
> | [0] stacksize 716800000
> | [0] guard_size 4096
> | [2] stackaddr 0x7f3aad468000
> | [2] stacksize 716800000
> | [2] guard_size 4096
> | [1] stackaddr 0x7f3ade8cf000
> | [1] stacksize 716800000
> | [1] guard_size 4096
> | [2]Done = 0.750809s
> | [0]Done = 0.75761s
> | [1]Done = 1.51807s
> | ===========================================
> |
> | I also tried to allocate the counter in the heap and ask for a local
> | allocation (numactl -l) but the results are similar.
> |
> | So I still cannot figure it out, I was also thinking that it might come
> | from
> | the binding or the stack but both look OK.
> |
> |
> | Bérenger Bramas
> |
> | HiePACS Project
> |
> | Tel (05 24 57) 40 76
> | INRIA BORDEAUX Sud Ouest
> |
> |
> | ----- Mail original -----
> | | De: "Samuel Thibault" <samuel.thibault@inria.fr>
> | | À: "Berenger Bramas" <berenger.bramas@inria.fr>
> | | Cc: starpu-devel@lists.gforge.inria.fr
> | | Envoyé: Mercredi 9 Septembre 2015 12:00:59
> | | Objet: Re: [Starpu-devel] Worker Binding Problem
> | |
> | | Berenger Bramas, le Wed 09 Sep 2015 11:42:01 +0200, a écrit :
> | | > With two threads it is OK:
> | | > Test with OpenMP:
> | | > [0]Done = 0.826635s
> | | > [1]Done = 0.828059s
> | | > Starpu:
> | | > [0]Done = 0.813087s
> | | > [1]Done = 0.826654s
> | | >
> | | > With three is start to be strange:
> | | > Test with OpenMP:
> | | > [0]Done = 0.825707s
> | | > [2]Done = 0.826624s
> | | > [1]Done = 0.827749s
> | | > Starpu:
> | | > [2]Done = 0.826262s
> | | > [0]Done = 0.826653s
> | | > [1]Done = 1.64255s
> | |
> | | That's really odd indeed, there's no ground reason why threads
> | | started by OpenMP and threads started by StarPU should behave
> | | differently. I suspect it could be related with the stack allocation,
> | | which StarPU currently just leaves to the OS. Could you check with
> | | pthread_attr_getstack what the addresses and sizes look like in both
> | | OpenMP and StarPU on your target machine? You could also try to call
> | | alloca(16384) before calling alloca again to allocate the variable you
> | | are working on, to make sure it gets allocated really locally (but there
> | | may also be cache association conflicts, that's why addresses should
> | | be checked too, in case the default pthread stack size happens to just
> | | always bring conflicts while perhaps OpenMP uses smaller stacks by
> | | default, in which case making the size in alloca(16384) vary according
> | | to worker may avoid the conflict).
> | |
> | | Samuel
> | |
> | _______________________________________________
> | Starpu-devel mailing list
> | Starpu-devel@lists.gforge.inria.fr
> | http://lists.gforge.inria.fr/mailman/listinfo/starpu-devel
> |
> _______________________________________________
> Starpu-devel mailing list
> Starpu-devel@lists.gforge.inria.fr
> http://lists.gforge.inria.fr/mailman/listinfo/starpu-devel
>




Archives gérées par MHonArc 2.6.19+.

Haut de le page