Objet : Developers list for StarPU
Archives de la liste
- From: Samuel Thibault <samuel.thibault@ens-lyon.org>
- To: Usman Dastgeer <usman.dastgeer@liu.se>
- Cc: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
- Subject: Re: [Starpu-devel] OpenMP timing with StarPU pheft
- Date: Fri, 2 Mar 2012 13:30:31 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Usman Dastgeer, le Thu 01 Mar 2012 18:52:20 +0100, a écrit :
> I am trying StarPU OpenMP support with pheft on vector_scale example given.
> I
> have some problem with scaling behavior as it shows really bad scaling from
> 1
> to 8 OpenMP threads on our server (fermi).
But AIUI you said in private mails that the exact same example works
fine on another machine?
> starpu_perfmodel_display -s vector_scale_parallel.fermi
> note: loading history from vector_scale_parallel instead of
> vector_scale_parallel.fermi
> performance model for cpu_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.689401e+05 3.754193e+01 30
> performance model for cpu_2_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.703561e+05 1.484666e+03 10
> performance model for cpu_3_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.705295e+05 2.117665e+03 10
> performance model for cpu_4_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.708724e+05 3.354376e+03 10
> performance model for cpu_5_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.712075e+05 4.227176e+03 10
> performance model for cpu_6_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.726255e+05 4.926117e+03 10
> performance model for cpu_7_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.743637e+05 4.627217e+03 10
> performance model for cpu_8_impl_0
> # hash size mean dev n
> 6530e077 8192000 2.727368e+05 1.467062e+03 10
>
> As you can see from above that using 1 cpu is better than 8 cpus; atleast
> that's how it shows but it shouldn't be.
So there must be something fishy. Could you check with top that the
program is really making use of more than 1 core?
> To investigate I have added some custom timers in "void
> scal_cpu_func(void *buffers[], void *_args)", using
> ...
> clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &rtime1);
> ... // see attached vector_scale.c for full source code
That only measures CPU time, and not wallclock time. I.e.
CLOCK_THREAD_CPUTIME_ID returns how much CPU time the thread spends. But
if there is binding issues (which I really believe is the problem here),
threads are indeed not slowed down, they just have a portion of the CPU
time of the core shared by all the threads, and overall the time is the
same.
Could you check in _starpu_bind_thread_on_cpus whether the hwloc binding
function really succeeds? Actually, do you have hwloc support enabled?
That'd explain everything: we don't currently have code for rebinding
FORKJOIN parallel tasks without hwloc. Actually, combined workers
without hwloc support is generally a bad idea since combined workers
are then built without any topology information, and thus, depending on
OS numbering, good or bad.
Samuel
- [Starpu-devel] OpenMP timing with StarPU pheft, Usman Dastgeer, 01/03/2012
- Re: [Starpu-devel] OpenMP timing with StarPU pheft, Usman Dastgeer, 01/03/2012
- <Suite(s) possible(s)>
- Re: [Starpu-devel] OpenMP timing with StarPU pheft, Samuel Thibault, 02/03/2012
- Re: [Starpu-devel] OpenMP timing with StarPU pheft, Usman Dastgeer, 02/03/2012
Archives gérées par MHonArc 2.6.19+.