starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

From: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
To: Olivier Aumage <olivier.aumage@inria.fr>
Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
Date: Thu, 21 Sep 2017 20:31:52 +0000
Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=mawussi.zounon@manchester.ac.uk; spf=Pass smtp.mailfrom=mawussi.zounon@manchester.ac.uk; spf=None smtp.helo=postmaster@probity.mcc.ac.uk
Ironport-phdr: 9a23:/MGAcR9IXacDP/9uRHKM819IXTAuvvDOBiVQ1KB90u4cTK2v8tzYMVDF4r011RmSAtWdtqoMotGVmp6jcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47WLmffqXyq7DMUBg63dU8sfry0ScbuiJGQ0+Gs+pDJKyxVgTOybPsmKxG3swTcrI8fnI5rJasZyx3To3IOdf4Alk1yIlfGuh/j+9yr/dZR9DlWvPRpo8tJTrjhZKV+X7tFCjMgG2U84sbruALfQBHJ73BaT2ZAwUkAOBTM8ByvBsS5iSD9rOconXDCZcA=
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Thanks Olivier,

Just one last question for curiosity :-).
Why is the initialization phase of StarPU on KNL so expensive while it
seems negligible on Broadwell and Haswell?

Best regards,
--Mawussi

The two performance results I showed about dgemm allow to estimate that the
StarPU initialization+shutdown time takes about 2 seconds on the KNL, and the
dgemm computation for 16000x16000 matrices takes about 5 seconds. Thus,
having the starpu_init() + starpu_shutdown() inside the timed portion of the
test increase the measured time by nearly 50%.

A StarPU session is started by a starpu_init() and closed by a
starpu_shutdown():
- If you call starpu_init() several time in a row _before_ calling
starpu_shutdown(), only the first one actually performs the initialization
while the others are NOP.
- If you call starpu_init() _after_ starpu_shutdown(), you get a full
initialization again, because all the StarPU data structures and worker
threads have been destroyed by starpu_shutdown(). In fact, the purpose of
starpu_shutdown is precisely to destroy these structures.

In most cases, you want a StarPU session to last for whole duration of the
execution. This is why you should have only one starpu_init() early in the
PLASMA library initialization, and only one starpu_shutdown() call when
PLASMA is being finalized.

Best regards,
--
Olivier

> Le 21 sept. 2017 à 18:05, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
> a écrit :
>
> Hi Olivier,
>
> Thanks for spotting the issue.
> After the modification, i.e., moving respectively, starpu_init() and
> starpu_shutdown() call into
> respectively plasma_init() and plasm_finalize(), I can observe comparable
> performance to the OpenMP version (even better) on all the architectures
> (Haswell, Broadwell, and KNL) for a quick benchmark.
>
> The problem is solved but it is still confusing since in our first
> version, the starpu_init() and starpu_shutdown() fonctions are called once
> in each high level function. In fact all compute routines do not call
> starpu_init() and starpu_shutdown(). Only the highest level functions do
> and this is done once per session.
> For example, a user is more likely to make a call to plasma_zgemm() which
> contains the StarPU initialization,
> and this initialization will be done once during the whole execution. At
> least in the case of DGEMM and DPORTF
> there are only one StarPU initialization per session, unless misunderstood
> the concept of session here.
>
> I will inform you once all the results are update.
>
> Thanks
>
> --Mawussi
> ________________________________________
> From: Olivier Aumage [olivier.aumage@inria.fr]
> Sent: Thursday, September 21, 2017 1:20 PM
> To: Mawussi Zounon
> Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
> Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel
> self-hosted KNL
>
> Hi Mawussi,
>
> I finally found the source of the performance mismatch between
> PLASMA/StarPU and Chameleon/StarPU.
>
> I realized that you have put the calls to starpu_init() and
> starpu_shutdown() in all the 'compute_*' routines. These calls should be
> performed only once per session, for instance as part of
> plasma_iinit/plasma_finalize, especially on the KNL where a lot of workers
> have to be prepared for computation.
>
>
> Here is the best result I get with your version of PLASMA/StarPU's DGEMM:
> Status Error Time Gflop/s transA transB m n
> k nb alpha beta padA padB padC
> -- -- 7.0201 1166.9288 n n 16000 16000
> 16000 860 1.23 + 2.35i 6.79 + 7.89i 0 0 0
>
> Here is the best result I get with PLASMA/StarPU's DGEMM, when doing
> starpu_init()/starpu_shutdown() once, as part of
> plasma_init()/plasma_finalize():
> Status Error Time Gflop/s transA transB m n
> k nb alpha beta padA padB padC
> -- -- 5.0897 1609.5128 n n 16000 16000
> 16000 880 1.23 + 2.35i 6.79 + 7.89i 0 0 0
>
> The second result is similar to what I get with Chameleon/StarPU's DGEMM.
>
> Best regards,
> --
> Olivier
>
>
>> Le 21 sept. 2017 à 11:38, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
>> a écrit :
>>
>> Thank you all for the discussion and the investigations.
>> It has been very helpful.
>>
>> We will turn to the Chameleon team to learn more from them.
>> In meantime, it can also be great for the KSTAR team to perform
>> some benchmark on KNL to help improving the library.
>>
>> Best regards,
>> --Mawussi
>> ________________________________________
>> From: Olivier Aumage [olivier.aumage@inria.fr]
>> Sent: Wednesday, September 20, 2017 11:14 AM
>> To: Mawussi Zounon
>> Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
>> Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel
>> self-hosted KNL
>>
>> Hi Mawussi,
>>
>> Here are the results I get with Chameleon/StarPU using time_dpotrf instead
>> of time_dpotrf_tile:
>> %-------------------
>> # CHAMELEON 0.9.1,
>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf
>> # Nb threads: 68
>> # Nb GPUs: 0
>> # NB: 480
>> # IB: 32
>> # eps: 1.110223e-16
>> #
>> # M N K/NRHS seconds Gflop/s Deviation
>> 2000 2000 1 0.058 45.64 +- 0.47
>> 4000 4000 1 0.119 179.89 +- 2.83
>> 6000 6000 1 0.194 370.70 +- 5.30
>> 8000 8000 1 0.273 624.49 +- 3.36
>> 10000 10000 1 0.397 839.48 +- 11.78
>> 12000 12000 1 0.570 1011.82 +- 17.20
>> 14000 14000 1 0.805 1137.05 +- 8.43
>> 16000 16000 1 1.147 1190.62 +- 26.26
>> 18000 18000 1 1.571 1237.49 +- 13.79
>> 20000 20000 1 2.089 1276.88 +- 2.41
>> %-------------------
>>
>> The performance models do not apply here, because they are only used on
>> heterogeneous platform to decide whether to map tasks on the main CPU or
>> on some GPU. Here, the KNL is seen as a set of homogeneous nodes.
>>
>> You should perhaps discuss with Chameleon developers to see if they have
>> an idea about what might differ between you Plasma implementation and the
>> Chameleon implementation. I do not know the Chameleon implementation
>> details.
>>
>> Otherwise, you can try to generate FXT traces from the native StarPU
>> execution and output the Gantt diagram of the execution to see what the
>> work/sleep pattern of workers looks like.
>>
>> Is there any significant in the task submission order or pattern between
>> Plasma and Chameleon? Do you submit all tasks at once or do you have
>> several submit/task_Wait phases?
>>
>> Best regards,
>> --
>> Olivier
>>
>>> Le 20 sept. 2017 à 11:13, Mawussi Zounon
>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>
>>>
>>> Hi Olivier,
>>>
>>> Thanks for the additional details.
>>> I also reproduced your numbers with Chameleon,
>>> what confirms that the problem is not related to
>>> the KNL configuration but rather to our PLASMA+StarPU code.
>>>
>>> I should say that so far we are not using any performance model
>>> while Chameleon seems to provide some guidances.
>>> While the priority option may be a potential explanation,
>>> I am still confused because I observed the same low performance
>>> issue for DGEMM. And DGEMM relies on a single kernel and doesn't need any
>>> priority.
>>>
>>> KSTAR has the same issue on KNL as well, what suggest that both our
>>> native StarPU code
>>> and KSTAT have similar bottleneck. It is becoming more curious to realize
>>> that on a 20-core Intel Haswell and on a 28-core Intel Broadwell,
>>> KSTAR, our native StarPU and Chameleon have comparable performance. And
>>> in addition our native StarPU and KSTAR outperform Chameleon slightly for
>>> large size matrices on the 28-core Intel Broadwell.
>>>
>>>
>>> To sum up, so far our finding is that on regular Intel architecture, all
>>> the libraries have similar performance. On Intel KNL, PLASMA+ OpenMP and
>>> Chameleon (StarPU) are providing a satisfactory result, while KSTAR and
>>> our native PLASMA+StarPU suffer from a serious performance penalty. Your
>>> further investigation revealed that the penalty might be due to the fact
>>> that the native PLASMA+StarPU spent twice time sleeping compared to
>>> Chameleon(based on StarPU).
>>> Knowing that DGEMM, that does not require any priority, suffers from the
>>> same issue, have you in mind other potential factors that may cause such
>>> an effect?
>>>
>>> PS: Your Chameleon experiments are based on "time_dpotrf_tile", which is
>>> a tile Cholesky kernel that does not include the time for layout
>>> conversion (Lapack layout conversion to tile layout, and conversion
>>> back). That is why you can observe a performance high than the OpenMP
>>> based version. For the sack of comparison, "time_dpotrf" should be
>>> preferred. But the difference of performance is reasonably low.
>>> Thanks for the support.
>>>
>>> Best regards,
>>> --Mawussi
>>>
>>>
>>>
>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>> Sent: Tuesday, September 19, 2017 3:31 PM
>>> To: Mawussi Zounon
>>> Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
>>> Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel
>>> self-hosted KNL
>>>
>>> Hi Mawussi,
>>>
>>> I have been able to test your port of Plasma over StarPU on the same KNL
>>> machine I used with Chameleon+StarPU. I used the numactl method of MCDRAM
>>> allocation. The KNL is in flat,quad mode. The configure and environment
>>> settings for StarPU were the same as with Chameleon. The StarPU scheduler
>>> is 'lws'
>>>
>>> Out of the box, I get the results in the 'plasma_starpu.txt' file
>>> attached which are similar to what you obtained, with a maximum of ~740
>>> GFlop/s (bs=660).
>>> I obtained slightly better results (827 GFlop/s, bs=480) by compiling
>>> your Plasma library without the '-fopenmp' flag. However, this is still
>>> much below what we should obtain.
>>>
>>> Thus, I ran a test with StarPU workers' statistics enabled, with the
>>> following environment variables, for N=20000 and BS=420:
>>> export STARPU_PROFILING=1
>>> export STARPU_WORKER_STATS=1
>>>
>>> The results for Plasma/StarPU and Chameleon/StarPU are in the
>>> worker_activity_*.txt files attached. You will see that for both libs:
>>> - the workers execute roughly the same number of kernels: ~320 tasks;
>>> - the worker time spent executing is roughly the same: ~ 1700 ms;
>>> - the worker time spent sleeping for the Plasma+StarPU execution (~950
>>> ms) is slightly more than 2x the time spent sleeping for the
>>> Chameleon+StarPU execution (~380ms).
>>>
>>> Thus, this strongly suggests that the Plasma+StarPU execution suffers
>>> from lack of parallelism. This lack of parallelism is likely due to the
>>> lack of priorities to guide the execution over the critical path.
>>>
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>
>>>
>>>
>>>> Le 19 sept. 2017 à 10:16, Olivier Aumage <olivier.aumage@inria.fr> a
>>>> écrit :
>>>>
>>>> Dear Mawussi,
>>>>
>>>> I was curious to check the impact of "numactl -m 1" over hbw_malloc()
>>>> for StarPU. I used hbw_malloc only for allocating the matrix, while
>>>> "numactl -m 1" puts every data structure (even StarPU's tasks queues,
>>>> data handles and synchronization objects) into the MCDRAM. Since, the
>>>> MCDRAM has a higher bandwidth, but also a higher latency, I did not know
>>>> whether the benefit of the higher bandwidth would be compensated by
>>>> higher latency costs on synchronization objects.
>>>>
>>>> It turns out that the global "numactl -m 1" approach gives better
>>>> results than the matrix-only hbw_malloc() approach. The best numactl
>>>> result I obtained (with block size 448) is almost 200 GFlop/s higher
>>>> than the best result with the matrix-only hbw_malloc():
>>>>
>>>> %----------------
>>>> # CHAMELEON 0.9.1,
>>>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>>>> # Nb threads: 68
>>>> # Nb GPUs: 0
>>>> # Nb mpi: 1
>>>> # PxQ: 1x1
>>>> # NB: 448
>>>> # IB: 32
>>>> # eps: 1.110223e-16
>>>> #
>>>> # M N K/NRHS seconds Gflop/s Deviation
>>>> 2000 2000 1 0.047 56.44 +- 2.11
>>>> 4000 4000 1 0.101 211.59 +- 2.88
>>>> 6000 6000 1 0.158 455.82 +- 5.43
>>>> 8000 8000 1 0.222 770.25 +- 7.23
>>>> 10000 10000 1 0.317 1052.57 +- 21.74
>>>> 12000 12000 1 0.460 1251.44 +- 17.22
>>>> 14000 14000 1 0.676 1353.75 +- 20.59
>>>> 16000 16000 1 0.962 1420.32 +- 16.61
>>>> 18000 18000 1 1.329 1462.70 +- 21.28
>>>> 20000 20000 1 1.789 1490.79 +- 9.21
>>>> %----------------
>>>>
>>>> Here are the test settings:
>>>>
>>>> - libhwloc:
>>>> . version 1.11.7
>>>> . no specific settings
>>>>
>>>> - StarPU:
>>>> . Version: Subversion repository, branch trunk/, revision r22030
>>>> . Compiler: Intel 17
>>>> . Configure flags (I give a snippet from my GNU Bash script):
>>>> %--------
>>>> declare -a cfg
>>>> cfg+=("--enable-shared")
>>>> cfg+=("--disable-cuda")
>>>> cfg+=("--disable-opencl")
>>>> cfg+=("--disable-socl")
>>>> cfg+=("--without-fxt")
>>>> cfg+=("--disable-debug")
>>>> cfg+=("--enable-fast")
>>>> cfg+=("--disable-verbose")
>>>> cfg+=("--disable-gcc-extensions")
>>>> cfg+=("--disable-mpi-check")
>>>> cfg+=("--disable-starpu-top")
>>>> cfg+=("--disable-starpufft")
>>>> cfg+=("--disable-build-doc")
>>>> cfg+=("--disable-openmp")
>>>> cfg+=("--disable-fortran")
>>>> cfg+=("--disable-build-tests")
>>>> cfg+=("--disable-build-examples")
>>>> cfg+=("--enable-mpi")
>>>> cfg+=("--enable-blas-lib=none")
>>>> cfg+=("--disable-mlr")
>>>> cfg+=("--enable-maxcpus=72")
>>>> $STARPU_SRC_DIR/configure --prefix=$STARPU_INSTALL_DIR "${cfg[@]}"
>>>> %--------
>>>>
>>>> - Chameleon settings:
>>>> . compiler / mkl: Intel 17
>>>> . cmake flags:
>>>> cmake \
>>>> -DCHAMELEON_ENABLE_EXAMPLE=OFF \
>>>> -DBLAS_VERBOSE=ON \
>>>> -DCHAMELEON_USE_CUDA=OFF \
>>>> -DCHAMELEON_USE_MPI=ON \
>>>> -DCHAMELEON_SIMULATION=OFF \
>>>> -DCHAMELEON_SCHED_STARPU=ON \
>>>> -DCMAKE_INSTALL_PREFIX=$HOME/Linalg/install-ch \
>>>> -DCMAKE_C_COMPILER=icc \
>>>> -DCMAKE_CXX_COMPILER=icpc \
>>>> -DCMAKE_Fortran_COMPILER=ifort \
>>>> -DCMAKE_BUILD_TYPE=Release \
>>>> ../chameleon.git
>>>>
>>>> - Launch settings:
>>>> STARPU_NCPU=68 STARPU_SCHED=lws numactl -m 1
>>>> ./install-ch/lib/chameleon/timing/time_dpotrf_tile -N 2000:20000:2000
>>>> -b $448 --niter 10
>>>>
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>>> Le 18 sept. 2017 à 23:09, Mawussi Zounon
>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>
>>>>> Dear Olivier,
>>>>>
>>>>> Thanks for taking your time to run the experiments.
>>>>> It is quite comforting to see your results. It is quite
>>>>> close to the numbers I got when using OpenMP.
>>>>>
>>>>> I re-run the tests again using the same nb as you,
>>>>> but my performance didn't improve.
>>>>>
>>>>> I have notice two main differences:
>>>>>
>>>>> • I am using "numactl -m 1" while you are allocating the memory
>>>>> via hbw_malloc(). From my experiments, both ways are equivalent in
>>>>> terms of performance.
>>>>> • I have just noticed that my KNL is currently configure in hybrid
>>>>> mode: 8GB allocatable and 8GB en cache mode. I will advocate to set the
>>>>> machine in flat mode then perform the experiment again.
>>>>> Did you use any specific installation flag when installing StarPU on
>>>>> KNL?
>>>>>
>>>>> Best regards,
>>>>> --Mawussi
>>>>>
>>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>>> Sent: Monday, September 18, 2017 4:43 PM
>>>>> To: Mawussi Zounon
>>>>> Cc: Samuel Thibault; Jakub Sistek; Negin Bagherpour;
>>>>> starpu-devel@lists.gforge.inria.fr
>>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>>
>>>>> Dear Mawussi,
>>>>>
>>>>> Thanks for the tarball and the Google sheet. I will run it to try to
>>>>> understand what is going on.
>>>>>
>>>>> Today I managed to run a dpotrf test from Chameleon with native StarPU.
>>>>> I modified Chameleon to use hbw_malloc() for the matrix. The test was
>>>>> run on the machine Frioul from CINES
>>>>> (https://www.cines.fr/le-supercalculateur-frioul/), with the following
>>>>> specs:
>>>>> - Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
>>>>> - Intel icc/ifort 17 + mkl 17
>>>>> - StarPU scheduler: lws
>>>>>
>>>>> The best result I got is the following one, using a blocksize of 424,
>>>>> reaching about 1.3TFlop/s for 20000x20000:
>>>>> #---------------
>>>>> #
>>>>> # CHAMELEON 0.9.1,
>>>>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>>>>> # Nb threads: 68
>>>>> # Nb GPUs: 0
>>>>> # Nb mpi: 1
>>>>> # PxQ: 1x1
>>>>> # NB: 424
>>>>> # IB: 32
>>>>> # eps: 1.110223e-16
>>>>> #
>>>>> # M N K/NRHS seconds Gflop/s Deviation
>>>>> 2000 2000 1 0.045 59.15 4.64
>>>>> 4000 4000 1 0.093 229.03 3.86
>>>>> 6000 6000 1 0.152 472.40 5.69
>>>>> 8000 8000 1 0.230 742.36 9.10
>>>>> 10000 10000 1 0.351 950.89 5.99
>>>>> 12000 12000 1 0.537 1072.95 11.30
>>>>> 14000 14000 1 0.783 1167.68 10.11
>>>>> 16000 16000 1 1.108 1232.57 6.83
>>>>> 18000 18000 1 1.543 1259.95 7.27
>>>>> 20000 20000 1 2.037 1309.28 6.66
>>>>> #---------------
>>>>>
>>>>> The results are highly sensitive to the block size. I also attach a
>>>>> plot showing the performance for various block sizes. It seems I used
>>>>> smaller blocks than in your tests. I do not know at this time whether
>>>>> this is the main explanation for the performance difference with see or
>>>>> not.
>>>>>
>>>>> I will study the data you sent me and come back to you asap.
>>>>>
>>>>> Thanks again.
>>>>> Best regards,
>>>>> --
>>>>> Olivier
>>>>>
>>>>>
>>>>>
>>>>>> Le 18 sept. 2017 à 17:15, Mawussi Zounon
>>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>>
>>>>>> Dear Olivier,
>>>>>> Please find attached the tarball of the StarPU version of plasma.
>>>>>> the compilation should be straightforward, but the make.inc can be
>>>>>> customized.
>>>>>> The tests are in the directory "test".
>>>>>> To test dgemm for example, you can run:
>>>>>> STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size
>>>>>> --nb=block_size, --test=n
>>>>>>
>>>>>> "numactl -m 1" to specify to allocate the memory in the MCDRAM
>>>>>> "--dim" to specify the size of the problem
>>>>>> "--nb" to specify the block size
>>>>>> "--test=n" to disable testing, to save the benchmark time.
>>>>>>
>>>>>> In general ./test "routine_name" --help will give you more details.
>>>>>>
>>>>>> I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
>>>>>>
>>>>>> I also shared a Google sheet with you to have an idea on the optimal
>>>>>> NBs.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> --Mawussi
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>>>> Sent: Monday, September 18, 2017 9:55 AM
>>>>>> To: Mawussi Zounon
>>>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>>>
>>>>>> Hi Mawussi,
>>>>>>
>>>>>> I would like to try to reproduce your results with native StarPU and
>>>>>> with LAPACK on KNL, to hopefully reduce a little bit the search space
>>>>>> for possible explanations. Is it possible to have a tar of your
>>>>>> current native StarPU port, with the 'configure' options you use, the
>>>>>> environment variables (if any), and the testing program ?
>>>>>>
>>>>>> Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use
>>>>>> the current version on Netlib or is it a modified version? Could you
>>>>>> give me the Makefile settings and environment variables if any?
>>>>>>
>>>>>> For allocating the matrix in MCDRAM, do you use hbw_malloc() from the
>>>>>> MemKind library, or do you use some other means? Do you get similar,
>>>>>> better or worse results when the MCDRAM is in 'cache' mode ?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Best regards,
>>>>>> --
>>>>>> Olivier
>>>>>>
>>>>>>> Le 15 sept. 2017 à 11:49, Mawussi Zounon
>>>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>>>
>>>>>>> Dear Olivier,
>>>>>>>
>>>>>>> Thanks for the pointing out the difference KSTAR and a native StarPU.
>>>>>>> I thought KSTAR performs a source to source compilation
>>>>>>> by replacing completely all OpenMP features by StarPU equivalents.
>>>>>>> But from your explanation, I have the impression that some OpenMP
>>>>>>> features
>>>>>>> remain in the executable produced by KSTAR, and this impacts the
>>>>>>> behaviour of StarPU.
>>>>>>> Please, can you provide us with a relevant reference on the KSTAR for
>>>>>>> a better understanding
>>>>>>> of how it works?
>>>>>>>
>>>>>>> The behaviour of the main thread seems a reasonable option to
>>>>>>> investigate to improve the performance penalty
>>>>>>> of the native StarPU code for small size matrices.
>>>>>>>
>>>>>>> On KNL, when using the MCDRAM, we used to observe some performance
>>>>>>> drop even for MKL
>>>>>>> from some matrix sizes. We have some potential explanations but we
>>>>>>> need further experiments for confirmation.
>>>>>>> However, even for reasonably small size matrices both the native
>>>>>>> StarPU and the KSTAR generated code fail
>>>>>>> to exploit the 68 cores efficiently; and their performance can even
>>>>>>> be worse than LAPACK.
>>>>>>> I think we should pay a close attention to the behaviour of StarPU on
>>>>>>> the Intel self-hosted KNL.
>>>>>>>
>>>>>>> Regarding the question on the choice of the block size, I reported
>>>>>>> only the auto-tuned results.
>>>>>>> For each executable (OMP, STARPU, KSTAR), and each matrix size, we
>>>>>>> perform
>>>>>>> a sweep over a large "block size" space. In general, for a given
>>>>>>> matrix size,
>>>>>>> OMP, STARPU, and KSTAR achieve the highest performance for almost the
>>>>>>> same "block size".
>>>>>>> But for some routines, larger "block sizes" (less tasks) seem to
>>>>>>> benefit the native StarPU.
>>>>>>>
>>>>>>> If it can help, find attached some results on an Intel Broadwell, a
>>>>>>> 28-core NUMA node (14 cores per socket).
>>>>>>> These results are very similar to the one obtained on the 20-core
>>>>>>> Haswell. OPM, KSTAR and STARPU
>>>>>>> have a similar asymptotic performance, while the native StarPU is
>>>>>>> penalized for small size matrices.
>>>>>>> At some extend, this confirm that the results on KNL have some
>>>>>>> serious issues and it worth
>>>>>>> investigating.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> --Mawussi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>>>>> Sent: Thursday, September 14, 2017 2:51 PM
>>>>>>> To: Mawussi Zounon
>>>>>>> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub
>>>>>>> Sistek; Negin Bagherpour
>>>>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>>>>
>>>>>>> Hi Mawussi,
>>>>>>>
>>>>>>> The main difference between StarPU used alone and StarPU used through
>>>>>>> the OpenMP compatibility layer concerns the 'main' thread of the
>>>>>>> application:
>>>>>>>
>>>>>>> - When StarPU is used alone on a N-core machine, it launches N worker
>>>>>>> threads bound on the N cores. (Thus there is a total of N+1 threads:
>>>>>>> 1 main application thread + N StarPU workers) The main application
>>>>>>> thread is only used for submitting tasks to StarPU, it does not
>>>>>>> execute tasks and it is not bound to a core unless the application
>>>>>>> binds it explicitly.
>>>>>>>
>>>>>>> - When StarPU is used through the OpenMP compatibility layer on a
>>>>>>> N-core, it launches N-1 worker threads bounds to N-1 cores. (Thus
>>>>>>> there is a total of N threads: 1 main application thread + N-1 StarPU
>>>>>>> workers) The main application thread is used for submitting tasks
>>>>>>> _and_ for participating in task execution while blocked on some
>>>>>>> barrier (e.g.: omp taskwait, implicit barriers at end of parallel
>>>>>>> regions, ...). This behaviour is required for compliance with the
>>>>>>> OpenMP execution model.
>>>>>>>
>>>>>>> I am not sure whether this difference is the unique cause of the
>>>>>>> performance mismatch you observed, but this probably count for some
>>>>>>> significant part at least, generally in favor of the StarPU+OpenMP.
>>>>>>> This may be the main factor for the difference on small matrices,
>>>>>>> where the main thread of the StarPU+OpenMP version can quickly give
>>>>>>> an hand to the computation, while the main thread for the native
>>>>>>> StarPU version must first be de-scheduled by the OS Kernel to leave
>>>>>>> its core to a worker thread.
>>>>>>>
>>>>>>> On the other hand, the management of tasks for StarPU+OpenMP is more
>>>>>>> expensive than the management of StarPU native tasks, due to the fact
>>>>>>> that StarPU+OpenMP tasks may block while native StarPU tasks never
>>>>>>> block. This management difference is therefore in favor of the native
>>>>>>> StarPU version. This additional management cost is perhaps more
>>>>>>> expensive on the KNL where the cores are much less advanced than the
>>>>>>> regular Intel xeon cores.
>>>>>>>
>>>>>>> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix
>>>>>>> size >= 14000. Only could think about some NUMA issue, but this
>>>>>>> should not be the case since you say that the matrix is allocated in
>>>>>>> MCDRAM.
>>>>>>>
>>>>>>> I do not know either why the StarPU+OpenMP plot and the native StarPU
>>>>>>> plot cross for the dpotrf.KNl test case.
>>>>>>>
>>>>>>> How do you choose the block size for each test sample ? Is it fixed
>>>>>>> for all matrix sizes or is it computed from the matrix size? Do you
>>>>>>> observe very different behaviour for other block sizes (e.g. fewer
>>>>>>> tasks on large blocks, more tasks on small blocks, ...)?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> --
>>>>>>> Olivier
>>>>>>>
>>>>>>>> Le 14 sept. 2017 à 11:00, Mawussi Zounon
>>>>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> Recently we developed a new version of the PLASMA library fully
>>>>>>>> based on the OpenMP task-based
>>>>>>>> runtime system. Our benchmark on both regular Intel Xeon (Haswell
>>>>>>>> and Broadwell in the experiment) and Intel KNL, showed that the new
>>>>>>>> OpenMP PLASMA has a performance comparable to the old version based
>>>>>>>> on QUARK. This motivated us to extend the experiment to StarPU.
>>>>>>>> To this end, on one hand we used KSTAR to generate a StarPU version
>>>>>>>> of PLASMA. On another hand we developed another version of PLASMA
>>>>>>>> (restricted to a few routines) based StarPU.
>>>>>>>> It is important to note that the algorithms are the same; we simply
>>>>>>>> replaced the task-based runtime system. Below are our findings:
>>>>>>>>
>>>>>>>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
>>>>>>>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
>>>>>>>> comparable performance, except for very small size matrices where
>>>>>>>> our hand written StarPU version of PLASMA is outperformed by the
>>>>>>>> generic KSTAR.
>>>>>>>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
>>>>>>>> version and KSTAR are significantly slow compared to OMP. But again
>>>>>>>> our KSTAR and our StarPU version exhibited difference performance
>>>>>>>> behaviour.
>>>>>>>> I am wondering whether you can provide us with some hints or
>>>>>>>> guidance to improve the performance of StarPU on the Intel KNL
>>>>>>>> architecture. There might be some configuration options I missed. In
>>>>>>>> addition I will be happy if you can help us to understand why our
>>>>>>>> StarPU version seemed more penalized for small size matrices while
>>>>>>>> KSTAR seems to be doing relatively better.
>>>>>>>>
>>>>>>>> Below some performance charts of dgemm and Cholesky (dpotrf) to
>>>>>>>> illustrate our observations:
>>>>>>>> <dgemm_haswell_rutimes.png>
>>>>>>>>
>>>>>>>>
>>>>>>>> <dpotrf_haswell_rutimes.png>
>>>>>>>>
>>>>>>>> <dgemm_knl_rutimes.png>
>>>>>>>>
>>>>>>>> <dpotrf_knl_rutimes.png>
>>>>>>>>
>>>>>>>> For the experiments on KNL, the matrices have been allocated in the
>>>>>>>> MCDRAM.
>>>>>>>>
>>>>>>>> I am looking forward to hearing from you.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> --Mawussi
>>>>>>>
>>>>>>> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
>>>>>>
>>>>>> <plasma_starpu.tar>
>>>>
>>>> _______________________________________________
>>>> Starpu-devel mailing list
>>>> Starpu-devel@lists.gforge.inria.fr
>>>> https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel
>>
>

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, (suite)

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL