Objet : Developers list for StarPU
Archives de la liste
- From: Olivier Aumage <olivier.aumage@inria.fr>
- To: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
- Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
- Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
- Date: Fri, 15 Sep 2017 12:19:25 +0200
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi Mawussi,
A quick answer regarding KStar: Here is a TPDS paper involving the compiler.
KStar is indeed a source-to-source compiler. It generates calls to the
functions defined in STARPU/src/util/openmp_runtime_support.c. These
functions either directly map to native StarPU calls, or implement additional
semantical requirements for compliance with the OpenMP specification. Some of
these requirements differ in important ways from the native StarPU semantics,
as described in my previous email.
I will try to get some more information here about the KNL issue.
Best regards,
--
Olivier
Attachment:
07912335.pdf
Description: Adobe PDF document
> Le 15 sept. 2017 à 11:49, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
> a écrit :
>
> Dear Olivier,
>
> Thanks for the pointing out the difference KSTAR and a native StarPU.
> I thought KSTAR performs a source to source compilation
> by replacing completely all OpenMP features by StarPU equivalents.
> But from your explanation, I have the impression that some OpenMP features
> remain in the executable produced by KSTAR, and this impacts the
> behaviour of StarPU.
> Please, can you provide us with a relevant reference on the KSTAR for a
> better understanding
> of how it works?
>
> The behaviour of the main thread seems a reasonable option to investigate
> to improve the performance penalty
> of the native StarPU code for small size matrices.
>
> On KNL, when using the MCDRAM, we used to observe some performance drop
> even for MKL
> from some matrix sizes. We have some potential explanations but we need
> further experiments for confirmation.
> However, even for reasonably small size matrices both the native StarPU and
> the KSTAR generated code fail
> to exploit the 68 cores efficiently; and their performance can even be
> worse than LAPACK.
> I think we should pay a close attention to the behaviour of StarPU on the
> Intel self-hosted KNL.
>
> Regarding the question on the choice of the block size, I reported only the
> auto-tuned results.
> For each executable (OMP, STARPU, KSTAR), and each matrix size, we perform
> a sweep over a large "block size" space. In general, for a given matrix
> size,
> OMP, STARPU, and KSTAR achieve the highest performance for almost the same
> "block size".
> But for some routines, larger "block sizes" (less tasks) seem to benefit
> the native StarPU.
>
> If it can help, find attached some results on an Intel Broadwell, a 28-core
> NUMA node (14 cores per socket).
> These results are very similar to the one obtained on the 20-core Haswell.
> OPM, KSTAR and STARPU
> have a similar asymptotic performance, while the native StarPU is penalized
> for small size matrices.
> At some extend, this confirm that the results on KNL have some serious
> issues and it worth
> investigating.
>
> Best regards,
> --Mawussi
>
>
>
>
> ________________________________________
> From: Olivier Aumage [olivier.aumage@inria.fr]
> Sent: Thursday, September 14, 2017 2:51 PM
> To: Mawussi Zounon
> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek;
> Negin Bagherpour
> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>
> Hi Mawussi,
>
> The main difference between StarPU used alone and StarPU used through the
> OpenMP compatibility layer concerns the 'main' thread of the application:
>
> - When StarPU is used alone on a N-core machine, it launches N worker
> threads bound on the N cores. (Thus there is a total of N+1 threads: 1 main
> application thread + N StarPU workers) The main application thread is only
> used for submitting tasks to StarPU, it does not execute tasks and it is
> not bound to a core unless the application binds it explicitly.
>
> - When StarPU is used through the OpenMP compatibility layer on a N-core,
> it launches N-1 worker threads bounds to N-1 cores. (Thus there is a total
> of N threads: 1 main application thread + N-1 StarPU workers) The main
> application thread is used for submitting tasks _and_ for participating in
> task execution while blocked on some barrier (e.g.: omp taskwait, implicit
> barriers at end of parallel regions, ...). This behaviour is required for
> compliance with the OpenMP execution model.
>
> I am not sure whether this difference is the unique cause of the
> performance mismatch you observed, but this probably count for some
> significant part at least, generally in favor of the StarPU+OpenMP. This
> may be the main factor for the difference on small matrices, where the main
> thread of the StarPU+OpenMP version can quickly give an hand to the
> computation, while the main thread for the native StarPU version must first
> be de-scheduled by the OS Kernel to leave its core to a worker thread.
>
> On the other hand, the management of tasks for StarPU+OpenMP is more
> expensive than the management of StarPU native tasks, due to the fact that
> StarPU+OpenMP tasks may block while native StarPU tasks never block. This
> management difference is therefore in favor of the native StarPU version.
> This additional management cost is perhaps more expensive on the KNL where
> the cores are much less advanced than the regular Intel xeon cores.
>
> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix size >=
> 14000. Only could think about some NUMA issue, but this should not be the
> case since you say that the matrix is allocated in MCDRAM.
>
> I do not know either why the StarPU+OpenMP plot and the native StarPU plot
> cross for the dpotrf.KNl test case.
>
> How do you choose the block size for each test sample ? Is it fixed for all
> matrix sizes or is it computed from the matrix size? Do you observe very
> different behaviour for other block sizes (e.g. fewer tasks on large
> blocks, more tasks on small blocks, ...)?
>
> Best regards,
> --
> Olivier
>
>> Le 14 sept. 2017 à 11:00, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
>> a écrit :
>>
>>
>> Dear all,
>>
>> Recently we developed a new version of the PLASMA library fully based on
>> the OpenMP task-based
>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and
>> Broadwell in the experiment) and Intel KNL, showed that the new OpenMP
>> PLASMA has a performance comparable to the old version based on QUARK.
>> This motivated us to extend the experiment to StarPU.
>> To this end, on one hand we used KSTAR to generate a StarPU version of
>> PLASMA. On another hand we developed another version of PLASMA (restricted
>> to a few routines) based StarPU.
>> It is important to note that the algorithms are the same; we simply
>> replaced the task-based runtime system. Below are our findings:
>>
>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
>> comparable performance, except for very small size matrices where our hand
>> written StarPU version of PLASMA is outperformed by the generic KSTAR.
>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
>> version and KSTAR are significantly slow compared to OMP. But again our
>> KSTAR and our StarPU version exhibited difference performance behaviour.
>> I am wondering whether you can provide us with some hints or guidance to
>> improve the performance of StarPU on the Intel KNL architecture. There
>> might be some configuration options I missed. In addition I will be happy
>> if you can help us to understand why our StarPU version seemed more
>> penalized for small size matrices while KSTAR seems to be doing relatively
>> better.
>>
>> Below some performance charts of dgemm and Cholesky (dpotrf) to illustrate
>> our observations:
>> <dgemm_haswell_rutimes.png>
>>
>>
>> <dpotrf_haswell_rutimes.png>
>>
>> <dgemm_knl_rutimes.png>
>>
>> <dpotrf_knl_rutimes.png>
>>
>> For the experiments on KNL, the matrices have been allocated in the MCDRAM.
>>
>> I am looking forward to hearing from you.
>>
>> Best regards,
>>
>> --Mawussi
>
> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
- [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 14/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 14/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 15/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 15/09/2017
- Message indisponible
- Message indisponible
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 18/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 18/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 19/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 19/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Samuel Thibault, 19/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 20/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 20/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 21/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 21/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 21/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 18/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 18/09/2017
- Message indisponible
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 15/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 14/09/2017
Archives gérées par MHonArc 2.6.19+.