Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL


Chronologique Discussions 
  • From: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
  • To: Olivier Aumage <olivier.aumage@inria.fr>
  • Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
  • Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
  • Date: Wed, 20 Sep 2017 09:13:17 +0000
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=mawussi.zounon@manchester.ac.uk; spf=Pass smtp.mailfrom=mawussi.zounon@manchester.ac.uk; spf=None smtp.helo=postmaster@probity.mcc.ac.uk
  • Ironport-phdr: 9a23:6kOGRBB5w0s5itbqfpGUUyQJP3N1i/DPJgcQr6AfoPdwSP76o8bcNUDSrc9gkEXOFd2Crakb26yL6+jJYi8p39WoiDg6aptCVhsI2409vjcLJ4q7M3D9N+PgdCcgHc5PBxdP9nC/NlVJSo6lPwWB6lX71zMZGw3+OAxpPay1X9eK14Xkn9y1rrrXYhtJiSD1SK53JRq75VHWssgIgIZ4bK8szxLGr1NJff5XzCVmPwTAsQz745KV9YF+6D9R88Am6shHV+2ueq0nUKdDDXI0NH0z48vDsBDFRguC/WcRSCMfmVxVAF6Wv1nBQp7tv36i5aJG0y6AMJizFOhsVA==
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>


Hi Olivier,

Thanks for the additional details.
I also reproduced your numbers with Chameleon,
what  confirms that the problem is not related to
the KNL configuration but rather to our PLASMA+StarPU code.

I should say that so far we are not using any performance model
while Chameleon seems to provide some guidances.
While the priority option may be a potential explanation,
I am still confused because I observed the same low performance
issue for DGEMM. And DGEMM relies on a single kernel and doesn't need any priority.

KSTAR has the same issue on KNL as well, what suggest that both our native StarPU code
and KSTAT have similar bottleneck. It is becoming more curious to realize that on a 20-core  Intel Haswell and  on a 28-core Intel Broadwell, KSTAR, our native StarPU and Chameleon have comparable performance. And in addition our native StarPU and KSTAR outperform Chameleon slightly for large size matrices on the 28-core Intel Broadwell.

 
To sum up, so far our finding is that on regular Intel architecture, all the libraries have similar performance. On Intel KNL,  PLASMA+ OpenMP and Chameleon (StarPU) are providing a satisfactory result, while KSTAR and our native PLASMA+StarPU suffer from a serious performance penalty. Your further investigation revealed that the penalty might be due to the fact that the native PLASMA+StarPU spent twice time sleeping compared to Chameleon(based on StarPU).
Knowing that DGEMM, that does not require any priority, suffers from the same issue, have you in mind other potential factors that may cause such an effect?

PS: Your Chameleon experiments are based on  "time_dpotrf_tile", which is a tile Cholesky kernel that does not include the time for  layout conversion  (Lapack layout conversion to tile layout, and conversion back). That is why you can observe a performance high than  the OpenMP based version. For the sack of comparison, "time_dpotrf" should be preferred.  But the  difference of performance is reasonably low.
Thanks for the support.

Best regards,
--Mawussi
 
  


From: Olivier Aumage [olivier.aumage@inria.fr]
Sent: Tuesday, September 19, 2017 3:31 PM
To: Mawussi Zounon
Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Hi Mawussi,

I have been able to test your port of Plasma over StarPU on the same KNL machine I used with Chameleon+StarPU. I used the numactl method of MCDRAM allocation. The KNL is in flat,quad mode. The configure and environment settings for StarPU were the same as with Chameleon. The StarPU scheduler is 'lws'

Out of the box, I get the results in the 'plasma_starpu.txt' file attached which are similar to what you obtained, with a maximum of ~740 GFlop/s (bs=660).
I obtained slightly better results (827 GFlop/s, bs=480) by compiling your Plasma library without the '-fopenmp' flag. However, this is still much below what we should obtain.

Thus, I ran a test with StarPU workers' statistics enabled, with the following environment variables, for N=20000 and BS=420:
export STARPU_PROFILING=1
export STARPU_WORKER_STATS=1

The results for Plasma/StarPU and Chameleon/StarPU are in the worker_activity_*.txt files attached. You will see that for both libs:
- the workers execute roughly the same number of kernels: ~320 tasks;
- the worker time spent executing is roughly the same: ~ 1700 ms;
- the worker time spent sleeping for the Plasma+StarPU execution (~950 ms) is slightly more than 2x the time spent sleeping for the Chameleon+StarPU execution (~380ms).

Thus, this strongly suggests that the Plasma+StarPU execution suffers from lack of parallelism. This lack of parallelism is likely due to the lack of priorities to guide the execution over the critical path.

Best regards,
--
Olivier




> Le 19 sept. 2017 à 10:16, Olivier Aumage <olivier.aumage@inria.fr> a écrit :
>
> Dear Mawussi,
>
> I was curious to check the impact of "numactl -m 1" over hbw_malloc() for StarPU. I used hbw_malloc only for allocating the matrix, while "numactl -m 1" puts every data structure (even StarPU's tasks queues, data handles and synchronization objects) into the MCDRAM. Since, the MCDRAM has a higher bandwidth, but also a higher latency, I did not know whether the benefit of the higher bandwidth would be compensated by higher latency costs on synchronization objects.
>
> It turns out that the global "numactl -m 1" approach gives better results than the matrix-only hbw_malloc() approach. The best numactl result I obtained (with block size 448) is almost 200 GFlop/s higher than the best result with the matrix-only hbw_malloc():
>
> %----------------
> # CHAMELEON 0.9.1, /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
> # Nb threads: 68
> # Nb GPUs:    0
> # Nb mpi:     1
> # PxQ:        1x1
> # NB:         448
> # IB:         32
> # eps:        1.110223e-16
> #
> #     M       N  K/NRHS   seconds   Gflop/s Deviation
>   2000    2000       1     0.047     56.44 +-   2.11 
>   4000    4000       1     0.101    211.59 +-   2.88 
>   6000    6000       1     0.158    455.82 +-   5.43 
>   8000    8000       1     0.222    770.25 +-   7.23 
>  10000   10000       1     0.317   1052.57 +-  21.74 
>  12000   12000       1     0.460   1251.44 +-  17.22 
>  14000   14000       1     0.676   1353.75 +-  20.59 
>  16000   16000       1     0.962   1420.32 +-  16.61 
>  18000   18000       1     1.329   1462.70 +-  21.28 
>  20000   20000       1     1.789   1490.79 +-   9.21 
> %----------------
>
> Here are the test settings:
>
> - libhwloc:
> . version 1.11.7
> . no specific settings
>
> - StarPU:
> . Version: Subversion repository, branch trunk/, revision r22030
> . Compiler: Intel 17
> . Configure flags (I give a snippet from my GNU Bash script):
> %--------
> declare -a cfg
> cfg+=("--enable-shared")
> cfg+=("--disable-cuda")
> cfg+=("--disable-opencl")
> cfg+=("--disable-socl")
> cfg+=("--without-fxt")
> cfg+=("--disable-debug")
> cfg+=("--enable-fast")
> cfg+=("--disable-verbose")
> cfg+=("--disable-gcc-extensions")
> cfg+=("--disable-mpi-check")
> cfg+=("--disable-starpu-top")
> cfg+=("--disable-starpufft")
> cfg+=("--disable-build-doc")
> cfg+=("--disable-openmp")
> cfg+=("--disable-fortran")
> cfg+=("--disable-build-tests")
> cfg+=("--disable-build-examples")
> cfg+=("--enable-mpi")
> cfg+=("--enable-blas-lib=none")
> cfg+=("--disable-mlr")
> cfg+=("--enable-maxcpus=72")
> $STARPU_SRC_DIR/configure --prefix=$STARPU_INSTALL_DIR "${cfg[@]}"
> %--------
>
> - Chameleon settings:
> . compiler / mkl: Intel 17
> . cmake flags:
> cmake \
>        -DCHAMELEON_ENABLE_EXAMPLE=OFF \
>        -DBLAS_VERBOSE=ON \
>        -DCHAMELEON_USE_CUDA=OFF \
>        -DCHAMELEON_USE_MPI=ON \
>        -DCHAMELEON_SIMULATION=OFF \
>        -DCHAMELEON_SCHED_STARPU=ON \
>        -DCMAKE_INSTALL_PREFIX=$HOME/Linalg/install-ch \
>        -DCMAKE_C_COMPILER=icc \
>        -DCMAKE_CXX_COMPILER=icpc \
>        -DCMAKE_Fortran_COMPILER=ifort \
>        -DCMAKE_BUILD_TYPE=Release \
>        ../chameleon.git
>
> - Launch settings:
> STARPU_NCPU=68 STARPU_SCHED=lws numactl -m 1 ./install-ch/lib/chameleon/timing/time_dpotrf_tile  -N 2000:20000:2000 -b $448 --niter 10
>
> Best regards,
> --
> Olivier
>
>> Le 18 sept. 2017 à 23:09, Mawussi Zounon <mawussi.zounon@manchester.ac.uk> a écrit :
>>
>> Dear Olivier,
>>
>> Thanks for taking your time to run the experiments.
>> It is quite comforting to see your results. It is quite
>> close to the numbers I got when using OpenMP.
>>
>> I re-run the tests again using the same nb as you,
>> but my performance didn't improve.
>>
>> I have notice two main differences:
>>
>>       •  I am using "numactl -m 1" while you are allocating the memory via hbw_malloc(). From my  experiments, both ways are equivalent in terms of performance. 
>>       •  I have just noticed that my KNL is currently configure in hybrid mode: 8GB allocatable and 8GB en cache mode. I will advocate to set the machine in flat mode then perform the experiment again.
>> Did you use any specific installation flag when installing StarPU on KNL?
>>
>> Best regards,
>> --Mawussi
>>
>> From: Olivier Aumage [olivier.aumage@inria.fr]
>> Sent: Monday, September 18, 2017 4:43 PM
>> To: Mawussi Zounon
>> Cc: Samuel Thibault; Jakub Sistek; Negin Bagherpour; starpu-devel@lists.gforge.inria.fr
>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>
>> Dear Mawussi,
>>
>> Thanks for the tarball and the Google sheet. I will run it to try to understand what is going on.
>>
>> Today I managed to run a dpotrf test from Chameleon with native StarPU. I modified Chameleon to use hbw_malloc() for the matrix. The test was run on the machine Frioul from CINES (https://www.cines.fr/le-supercalculateur-frioul/), with the following specs:
>> - Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
>> - Intel icc/ifort 17 + mkl 17
>> - StarPU scheduler: lws
>>
>> The best result I got is the following one, using a blocksize of 424, reaching about 1.3TFlop/s for 20000x20000:
>> #---------------
>> #
>> # CHAMELEON 0.9.1, /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>> # Nb threads: 68
>> # Nb GPUs:    0
>> # Nb mpi:     1
>> # PxQ:        1x1
>> # NB:         424
>> # IB:         32
>> # eps:        1.110223e-16
>> #
>> #     M       N  K/NRHS   seconds   Gflop/s Deviation
>>   2000    2000       1     0.045     59.15  4.64
>>   4000    4000       1     0.093    229.03  3.86
>>   6000    6000       1     0.152    472.40  5.69
>>   8000    8000       1     0.230    742.36  9.10
>>  10000   10000       1     0.351    950.89  5.99
>>  12000   12000       1     0.537   1072.95 11.30
>>  14000   14000       1     0.783   1167.68 10.11
>>  16000   16000       1     1.108   1232.57  6.83
>>  18000   18000       1     1.543   1259.95  7.27
>>  20000   20000       1     2.037   1309.28  6.66
>> #---------------
>>
>> The results are highly sensitive to the block size. I also attach a plot showing the performance for various block sizes. It seems I used smaller blocks than in your tests. I do not know at this time whether this is the main explanation for the performance difference with see or not.
>>
>> I will study the data you sent me and come back to you asap.
>>
>> Thanks again.
>> Best regards,
>> --
>> Olivier
>>
>>
>>
>>> Le 18 sept. 2017 à 17:15, Mawussi Zounon <mawussi.zounon@manchester.ac.uk> a écrit :
>>>
>>> Dear Olivier,
>>> Please find attached the tarball of the StarPU version of plasma.
>>> the compilation should be straightforward, but the make.inc can be customized.
>>> The tests are in the directory "test".
>>> To test dgemm for example, you can run:
>>> STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size --nb=block_size, --test=n
>>>
>>> "numactl -m 1"  to specify to allocate the memory in the MCDRAM
>>> "--dim" to specify the size of the problem
>>> "--nb" to specify the block size
>>> "--test=n" to disable testing, to save the benchmark time.
>>>
>>> In general ./test "routine_name" --help will give you more details.
>>>
>>> I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
>>>
>>> I also shared a Google sheet with you to have an idea on the optimal NBs.
>>>
>>> Best regards,
>>>
>>> --Mawussi
>>>
>>> ________________________________________
>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>> Sent: Monday, September 18, 2017 9:55 AM
>>> To: Mawussi Zounon
>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>
>>> Hi Mawussi,
>>>
>>> I would like to try to reproduce your results with native StarPU and with LAPACK on KNL, to hopefully reduce a little bit the search space for possible explanations. Is it possible to have a tar of your current native StarPU port, with the 'configure' options you use, the environment variables (if any), and the testing program ?
>>>
>>> Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use the current version on Netlib or is it a modified version? Could you give me the Makefile settings and environment variables if any?
>>>
>>> For allocating the matrix in MCDRAM, do you use hbw_malloc() from the MemKind library, or do you use some other means? Do you get similar, better or worse results when the MCDRAM is in 'cache' mode ?
>>>
>>> Thanks in advance.
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>> Le 15 sept. 2017 à 11:49, Mawussi Zounon <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>
>>>> Dear Olivier,
>>>>
>>>> Thanks for the pointing out the difference KSTAR and a native StarPU.
>>>> I thought KSTAR performs a source to source compilation
>>>> by replacing completely all OpenMP features by StarPU equivalents.
>>>> But from your explanation, I have the impression that some OpenMP features
>>>> remain  in the executable  produced by KSTAR, and this impacts the behaviour of StarPU.
>>>> Please, can you provide us with a relevant reference on the KSTAR for a better understanding
>>>> of how it works?
>>>>
>>>> The behaviour of the main thread seems a reasonable option to investigate to improve the performance penalty
>>>> of the native StarPU code for small size matrices.
>>>>
>>>> On KNL, when using the MCDRAM, we used to observe some performance drop even for MKL
>>>> from some matrix sizes. We have some potential explanations but we need further experiments for confirmation.
>>>> However, even for reasonably small size matrices both the native StarPU and the KSTAR generated code fail
>>>> to exploit the 68 cores efficiently; and their performance can even be worse than LAPACK.
>>>> I think we should pay a close attention to the behaviour of StarPU on the Intel self-hosted KNL.
>>>>
>>>> Regarding the question on the choice of the block size, I reported only the auto-tuned results.
>>>> For each executable (OMP, STARPU, KSTAR), and each matrix size, we perform
>>>> a sweep over a large "block size" space. In general, for a given matrix size,
>>>> OMP, STARPU, and KSTAR achieve the highest performance for almost the same "block size".
>>>> But for some routines, larger "block sizes" (less tasks) seem to benefit the native StarPU.
>>>>
>>>> If it can help, find attached some results on an Intel Broadwell, a 28-core NUMA node (14 cores per socket).
>>>> These results are very similar to the one obtained on the 20-core Haswell. OPM, KSTAR and STARPU
>>>> have a similar asymptotic performance, while the native StarPU is penalized for small size matrices.
>>>> At some extend, this confirm that the results on KNL have some serious issues and it worth
>>>> investigating.
>>>>
>>>> Best regards,
>>>> --Mawussi
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>> Sent: Thursday, September 14, 2017 2:51 PM
>>>> To: Mawussi Zounon
>>>> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek; Negin Bagherpour
>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>
>>>> Hi Mawussi,
>>>>
>>>> The main difference between StarPU used alone and StarPU used through the OpenMP compatibility layer concerns the 'main' thread of the application:
>>>>
>>>> - When StarPU is used alone on a N-core machine, it launches N worker threads bound on the N cores. (Thus there is a total of N+1 threads: 1 main application thread + N StarPU workers) The main application thread is only used for submitting tasks to StarPU, it does not execute tasks and it is not bound to a core unless the application binds it explicitly.
>>>>
>>>> - When StarPU is used through the OpenMP compatibility layer on a N-core, it launches N-1 worker threads bounds to N-1 cores. (Thus there is a total of N threads: 1 main application thread + N-1 StarPU workers) The main application thread is used for submitting tasks _and_ for participating in task execution while blocked on some barrier (e.g.: omp taskwait, implicit barriers at end of parallel regions, ...). This behaviour is required for compliance with the OpenMP execution model.
>>>>
>>>> I am not sure whether this difference is the unique cause of the performance mismatch you observed, but this probably count for some significant part at least, generally in favor of the StarPU+OpenMP. This may be the main factor for the difference on small matrices, where the main thread of the StarPU+OpenMP version can quickly give an hand to the computation, while the main thread for the native StarPU version must first be de-scheduled by the OS Kernel to leave its core to a worker thread.
>>>>
>>>> On the other hand, the management of tasks for StarPU+OpenMP is more expensive than the management of StarPU native tasks, due to the fact that StarPU+OpenMP tasks may block while native StarPU tasks never block. This management difference is therefore in favor of the native StarPU version. This additional management cost is perhaps more expensive on the KNL where the cores are much less advanced than the regular Intel xeon cores.
>>>>
>>>> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix size >= 14000. Only could think about some NUMA issue, but this should not be the case since you say that the matrix is allocated in MCDRAM.
>>>>
>>>> I do not know either why the StarPU+OpenMP plot and the native StarPU plot cross for the dpotrf.KNl test case.
>>>>
>>>> How do you choose the block size for each test sample ? Is it fixed for all matrix sizes or is it computed from the matrix size? Do you observe very different behaviour for other block sizes (e.g. fewer tasks on large blocks, more tasks on small blocks, ...)?
>>>>
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>>> Le 14 sept. 2017 à 11:00, Mawussi Zounon <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>
>>>>>
>>>>> Dear all,
>>>>>
>>>>> Recently we developed a new version of the PLASMA library fully based on the OpenMP task-based
>>>>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and Broadwell in the experiment) and Intel KNL, showed that the new OpenMP PLASMA has a performance comparable to the old version based on QUARK. This motivated us to extend the experiment to StarPU.
>>>>> To this end, on one hand we used KSTAR to generate a StarPU version of PLASMA. On another hand we developed another version of PLASMA (restricted to a few routines) based StarPU.
>>>>> It is important to note that the algorithms are the same; we simply replaced the task-based runtime system. Below are our findings:
>>>>>
>>>>>    • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)  PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have comparable performance, except for very small size matrices where our hand written StarPU version of PLASMA is outperformed by the generic KSTAR.
>>>>>    • On the Intel Self-hosted KNL (68 cores), both our own STARPU version and KSTAR are significantly slow compared to OMP.  But again our KSTAR and our StarPU version exhibited difference performance behaviour.
>>>>> I am wondering whether you can provide us with some hints or guidance to improve the performance of StarPU on the Intel KNL architecture. There might be some configuration options I missed. In addition I will be happy if you can help us to understand why our StarPU version seemed more penalized for small size matrices while KSTAR seems to be doing relatively  better.
>>>>>
>>>>> Below some performance charts of dgemm and Cholesky (dpotrf) to illustrate our observations:
>>>>> <dgemm_haswell_rutimes.png>
>>>>>
>>>>>
>>>>> <dpotrf_haswell_rutimes.png>
>>>>>
>>>>> <dgemm_knl_rutimes.png>
>>>>>
>>>>> <dpotrf_knl_rutimes.png>
>>>>>
>>>>> For the experiments on KNL, the matrices have been allocated in the MCDRAM.
>>>>>
>>>>> I am looking forward to hearing from you.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --Mawussi
>>>>
>>>> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
>>>
>>> <plasma_starpu.tar>
>
> _______________________________________________
> Starpu-devel mailing list
> Starpu-devel@lists.gforge.inria.fr
> https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel




Archives gérées par MHonArc 2.6.19+.

Haut de le page