Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL


Chronologique Discussions 
  • From: Olivier Aumage <olivier.aumage@inria.fr>
  • To: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
  • Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, starpu-devel@lists.gforge.inria.fr, Jakub Sistek <jakub.sistek@manchester.ac.uk>
  • Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
  • Date: Mon, 18 Sep 2017 17:43:23 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Dear Mawussi,

Thanks for the tarball and the Google sheet. I will run it to try to
understand what is going on.

Today I managed to run a dpotrf test from Chameleon with native StarPU. I
modified Chameleon to use hbw_malloc() for the matrix. The test was run on
the machine Frioul from CINES
(https://www.cines.fr/le-supercalculateur-frioul/), with the following specs:
- Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
- Intel icc/ifort 17 + mkl 17
- StarPU scheduler: lws

The best result I got is the following one, using a blocksize of 424,
reaching about 1.3TFlop/s for 20000x20000:
#---------------
#
# CHAMELEON 0.9.1,
/home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
# Nb threads: 68
# Nb GPUs: 0
# Nb mpi: 1
# PxQ: 1x1
# NB: 424
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
2000 2000 1 0.045 59.15 4.64
4000 4000 1 0.093 229.03 3.86
6000 6000 1 0.152 472.40 5.69
8000 8000 1 0.230 742.36 9.10
10000 10000 1 0.351 950.89 5.99
12000 12000 1 0.537 1072.95 11.30
14000 14000 1 0.783 1167.68 10.11
16000 16000 1 1.108 1232.57 6.83
18000 18000 1 1.543 1259.95 7.27
20000 20000 1 2.037 1309.28 6.66
#---------------

The results are highly sensitive to the block size. I also attach a plot
showing the performance for various block sizes. It seems I used smaller
blocks than in your tests. I do not know at this time whether this is the
main explanation for the performance difference with see or not.

I will study the data you sent me and come back to you asap.

Thanks again.
Best regards,
--
Olivier


Attachment: potrf_chameleon_starpu_knl.pdf
Description: Adobe PDF document


> Le 18 sept. 2017 à 17:15, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
> a écrit :
>
> Dear Olivier,
> Please find attached the tarball of the StarPU version of plasma.
> the compilation should be straightforward, but the make.inc can be
> customized.
> The tests are in the directory "test".
> To test dgemm for example, you can run:
> STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size --nb=block_size,
> --test=n
>
> "numactl -m 1" to specify to allocate the memory in the MCDRAM
> "--dim" to specify the size of the problem
> "--nb" to specify the block size
> "--test=n" to disable testing, to save the benchmark time.
>
> In general ./test "routine_name" --help will give you more details.
>
> I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
>
> I also shared a Google sheet with you to have an idea on the optimal NBs.
>
> Best regards,
>
> --Mawussi
>
> ________________________________________
> From: Olivier Aumage [olivier.aumage@inria.fr]
> Sent: Monday, September 18, 2017 9:55 AM
> To: Mawussi Zounon
> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>
> Hi Mawussi,
>
> I would like to try to reproduce your results with native StarPU and with
> LAPACK on KNL, to hopefully reduce a little bit the search space for
> possible explanations. Is it possible to have a tar of your current native
> StarPU port, with the 'configure' options you use, the environment
> variables (if any), and the testing program ?
>
> Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use the
> current version on Netlib or is it a modified version? Could you give me
> the Makefile settings and environment variables if any?
>
> For allocating the matrix in MCDRAM, do you use hbw_malloc() from the
> MemKind library, or do you use some other means? Do you get similar, better
> or worse results when the MCDRAM is in 'cache' mode ?
>
> Thanks in advance.
> Best regards,
> --
> Olivier
>
>> Le 15 sept. 2017 à 11:49, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
>> a écrit :
>>
>> Dear Olivier,
>>
>> Thanks for the pointing out the difference KSTAR and a native StarPU.
>> I thought KSTAR performs a source to source compilation
>> by replacing completely all OpenMP features by StarPU equivalents.
>> But from your explanation, I have the impression that some OpenMP features
>> remain in the executable produced by KSTAR, and this impacts the
>> behaviour of StarPU.
>> Please, can you provide us with a relevant reference on the KSTAR for a
>> better understanding
>> of how it works?
>>
>> The behaviour of the main thread seems a reasonable option to investigate
>> to improve the performance penalty
>> of the native StarPU code for small size matrices.
>>
>> On KNL, when using the MCDRAM, we used to observe some performance drop
>> even for MKL
>> from some matrix sizes. We have some potential explanations but we need
>> further experiments for confirmation.
>> However, even for reasonably small size matrices both the native StarPU
>> and the KSTAR generated code fail
>> to exploit the 68 cores efficiently; and their performance can even be
>> worse than LAPACK.
>> I think we should pay a close attention to the behaviour of StarPU on the
>> Intel self-hosted KNL.
>>
>> Regarding the question on the choice of the block size, I reported only
>> the auto-tuned results.
>> For each executable (OMP, STARPU, KSTAR), and each matrix size, we perform
>> a sweep over a large "block size" space. In general, for a given matrix
>> size,
>> OMP, STARPU, and KSTAR achieve the highest performance for almost the same
>> "block size".
>> But for some routines, larger "block sizes" (less tasks) seem to benefit
>> the native StarPU.
>>
>> If it can help, find attached some results on an Intel Broadwell, a
>> 28-core NUMA node (14 cores per socket).
>> These results are very similar to the one obtained on the 20-core Haswell.
>> OPM, KSTAR and STARPU
>> have a similar asymptotic performance, while the native StarPU is
>> penalized for small size matrices.
>> At some extend, this confirm that the results on KNL have some serious
>> issues and it worth
>> investigating.
>>
>> Best regards,
>> --Mawussi
>>
>>
>>
>>
>> ________________________________________
>> From: Olivier Aumage [olivier.aumage@inria.fr]
>> Sent: Thursday, September 14, 2017 2:51 PM
>> To: Mawussi Zounon
>> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek;
>> Negin Bagherpour
>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>
>> Hi Mawussi,
>>
>> The main difference between StarPU used alone and StarPU used through the
>> OpenMP compatibility layer concerns the 'main' thread of the application:
>>
>> - When StarPU is used alone on a N-core machine, it launches N worker
>> threads bound on the N cores. (Thus there is a total of N+1 threads: 1
>> main application thread + N StarPU workers) The main application thread is
>> only used for submitting tasks to StarPU, it does not execute tasks and it
>> is not bound to a core unless the application binds it explicitly.
>>
>> - When StarPU is used through the OpenMP compatibility layer on a N-core,
>> it launches N-1 worker threads bounds to N-1 cores. (Thus there is a total
>> of N threads: 1 main application thread + N-1 StarPU workers) The main
>> application thread is used for submitting tasks _and_ for participating in
>> task execution while blocked on some barrier (e.g.: omp taskwait, implicit
>> barriers at end of parallel regions, ...). This behaviour is required for
>> compliance with the OpenMP execution model.
>>
>> I am not sure whether this difference is the unique cause of the
>> performance mismatch you observed, but this probably count for some
>> significant part at least, generally in favor of the StarPU+OpenMP. This
>> may be the main factor for the difference on small matrices, where the
>> main thread of the StarPU+OpenMP version can quickly give an hand to the
>> computation, while the main thread for the native StarPU version must
>> first be de-scheduled by the OS Kernel to leave its core to a worker
>> thread.
>>
>> On the other hand, the management of tasks for StarPU+OpenMP is more
>> expensive than the management of StarPU native tasks, due to the fact that
>> StarPU+OpenMP tasks may block while native StarPU tasks never block. This
>> management difference is therefore in favor of the native StarPU version.
>> This additional management cost is perhaps more expensive on the KNL where
>> the cores are much less advanced than the regular Intel xeon cores.
>>
>> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix size >=
>> 14000. Only could think about some NUMA issue, but this should not be the
>> case since you say that the matrix is allocated in MCDRAM.
>>
>> I do not know either why the StarPU+OpenMP plot and the native StarPU plot
>> cross for the dpotrf.KNl test case.
>>
>> How do you choose the block size for each test sample ? Is it fixed for
>> all matrix sizes or is it computed from the matrix size? Do you observe
>> very different behaviour for other block sizes (e.g. fewer tasks on large
>> blocks, more tasks on small blocks, ...)?
>>
>> Best regards,
>> --
>> Olivier
>>
>>> Le 14 sept. 2017 à 11:00, Mawussi Zounon
>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>
>>>
>>> Dear all,
>>>
>>> Recently we developed a new version of the PLASMA library fully based on
>>> the OpenMP task-based
>>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and
>>> Broadwell in the experiment) and Intel KNL, showed that the new OpenMP
>>> PLASMA has a performance comparable to the old version based on QUARK.
>>> This motivated us to extend the experiment to StarPU.
>>> To this end, on one hand we used KSTAR to generate a StarPU version of
>>> PLASMA. On another hand we developed another version of PLASMA
>>> (restricted to a few routines) based StarPU.
>>> It is important to note that the algorithms are the same; we simply
>>> replaced the task-based runtime system. Below are our findings:
>>>
>>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
>>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
>>> comparable performance, except for very small size matrices where our
>>> hand written StarPU version of PLASMA is outperformed by the generic
>>> KSTAR.
>>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
>>> version and KSTAR are significantly slow compared to OMP. But again our
>>> KSTAR and our StarPU version exhibited difference performance behaviour.
>>> I am wondering whether you can provide us with some hints or guidance to
>>> improve the performance of StarPU on the Intel KNL architecture. There
>>> might be some configuration options I missed. In addition I will be happy
>>> if you can help us to understand why our StarPU version seemed more
>>> penalized for small size matrices while KSTAR seems to be doing
>>> relatively better.
>>>
>>> Below some performance charts of dgemm and Cholesky (dpotrf) to
>>> illustrate our observations:
>>> <dgemm_haswell_rutimes.png>
>>>
>>>
>>> <dpotrf_haswell_rutimes.png>
>>>
>>> <dgemm_knl_rutimes.png>
>>>
>>> <dpotrf_knl_rutimes.png>
>>>
>>> For the experiments on KNL, the matrices have been allocated in the
>>> MCDRAM.
>>>
>>> I am looking forward to hearing from you.
>>>
>>> Best regards,
>>>
>>> --Mawussi
>>
>> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
>
> <plasma_starpu.tar>




Archives gérées par MHonArc 2.6.19+.

Haut de le page