Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] OpenMP timing with StarPU pheft

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] OpenMP timing with StarPU pheft


Chronologique Discussions 
  • From: Samuel Thibault <samuel.thibault@ens-lyon.org>
  • To: Usman Dastgeer <usman.dastgeer@liu.se>
  • Cc: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
  • Subject: Re: [Starpu-devel] OpenMP timing with StarPU pheft
  • Date: Fri, 2 Mar 2012 13:30:31 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Usman Dastgeer, le Thu 01 Mar 2012 18:52:20 +0100, a écrit :
> I am trying StarPU OpenMP support with pheft on vector_scale example given.
> I
> have some problem with scaling behavior as it shows really bad scaling from
> 1
> to 8 OpenMP threads on our server (fermi).

But AIUI you said in private mails that the exact same example works
fine on another machine?

> starpu_perfmodel_display -s vector_scale_parallel.fermi
> note: loading history from vector_scale_parallel instead of
> vector_scale_parallel.fermi
> performance model for cpu_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.689401e+05 3.754193e+01 30
> performance model for cpu_2_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.703561e+05 1.484666e+03 10
> performance model for cpu_3_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.705295e+05 2.117665e+03 10
> performance model for cpu_4_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.708724e+05 3.354376e+03 10
> performance model for cpu_5_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.712075e+05 4.227176e+03 10
> performance model for cpu_6_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.726255e+05 4.926117e+03 10
> performance model for cpu_7_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.743637e+05 4.627217e+03 10
> performance model for cpu_8_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.727368e+05 1.467062e+03 10
>
> As you can see from above that using 1 cpu is better than 8 cpus; atleast
> that's how it shows but it shouldn't be.

So there must be something fishy. Could you check with top that the
program is really making use of more than 1 core?

> To investigate I have added some custom timers in "void
> scal_cpu_func(void *buffers[], void *_args)", using
> ...
> clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &rtime1);
> ... // see attached vector_scale.c for full source code

That only measures CPU time, and not wallclock time. I.e.
CLOCK_THREAD_CPUTIME_ID returns how much CPU time the thread spends. But
if there is binding issues (which I really believe is the problem here),
threads are indeed not slowed down, they just have a portion of the CPU
time of the core shared by all the threads, and overall the time is the
same.

Could you check in _starpu_bind_thread_on_cpus whether the hwloc binding
function really succeeds? Actually, do you have hwloc support enabled?
That'd explain everything: we don't currently have code for rebinding
FORKJOIN parallel tasks without hwloc. Actually, combined workers
without hwloc support is generally a bad idea since combined workers
are then built without any topology information, and thus, depending on
OS numbering, good or bad.

Samuel





Archives gérées par MHonArc 2.6.19+.

Haut de le page