starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
Date: Thu, 21 Sep 2017 14:20:09 +0200
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Mawussi,

I finally found the source of the performance mismatch between PLASMA/StarPU
and Chameleon/StarPU.

I realized that you have put the calls to starpu_init() and starpu_shutdown()
in all the 'compute_*' routines. These calls should be performed only once
per session, for instance as part of plasma_init/plasma_finalize, especially
on the KNL where a lot of workers have to be prepared for computation.

Here is the best result I get with your version of PLASMA/StarPU's DGEMM:
Status Error Time Gflop/s transA transB m n
k nb alpha beta padA padB padC
-- -- 7.0201 1166.9288 n n 16000 16000
16000 860 1.23 + 2.35i 6.79 + 7.89i 0 0 0

Here is the best result I get with PLASMA/StarPU's DGEMM, when doing
starpu_init()/starpu_shutdown() once, as part of
plasma_init()/plasma_finalize():
Status Error Time Gflop/s transA transB m n
k nb alpha beta padA padB padC
-- -- 5.0897 1609.5128 n n 16000 16000
16000 880 1.23 + 2.35i 6.79 + 7.89i 0 0 0

The second result is similar to what I get with Chameleon/StarPU's DGEMM.

Best regards,
--
Olivier

> Le 21 sept. 2017 à 11:38, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
> a écrit :
>
> Thank you all for the discussion and the investigations.
> It has been very helpful.
>
> We will turn to the Chameleon team to learn more from them.
> In meantime, it can also be great for the KSTAR team to perform
> some benchmark on KNL to help improving the library.
>
> Best regards,
> --Mawussi
> ________________________________________
> From: Olivier Aumage [olivier.aumage@inria.fr]
> Sent: Wednesday, September 20, 2017 11:14 AM
> To: Mawussi Zounon
> Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
> Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel
> self-hosted KNL
>
> Hi Mawussi,
>
> Here are the results I get with Chameleon/StarPU using time_dpotrf instead
> of time_dpotrf_tile:
> %-------------------
> # CHAMELEON 0.9.1,
> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf
> # Nb threads: 68
> # Nb GPUs: 0
> # NB: 480
> # IB: 32
> # eps: 1.110223e-16
> #
> # M N K/NRHS seconds Gflop/s Deviation
> 2000 2000 1 0.058 45.64 +- 0.47
> 4000 4000 1 0.119 179.89 +- 2.83
> 6000 6000 1 0.194 370.70 +- 5.30
> 8000 8000 1 0.273 624.49 +- 3.36
> 10000 10000 1 0.397 839.48 +- 11.78
> 12000 12000 1 0.570 1011.82 +- 17.20
> 14000 14000 1 0.805 1137.05 +- 8.43
> 16000 16000 1 1.147 1190.62 +- 26.26
> 18000 18000 1 1.571 1237.49 +- 13.79
> 20000 20000 1 2.089 1276.88 +- 2.41
> %-------------------
>
> The performance models do not apply here, because they are only used on
> heterogeneous platform to decide whether to map tasks on the main CPU or on
> some GPU. Here, the KNL is seen as a set of homogeneous nodes.
>
> You should perhaps discuss with Chameleon developers to see if they have an
> idea about what might differ between you Plasma implementation and the
> Chameleon implementation. I do not know the Chameleon implementation
> details.
>
> Otherwise, you can try to generate FXT traces from the native StarPU
> execution and output the Gantt diagram of the execution to see what the
> work/sleep pattern of workers looks like.
>
> Is there any significant in the task submission order or pattern between
> Plasma and Chameleon? Do you submit all tasks at once or do you have
> several submit/task_Wait phases?
>
> Best regards,
> --
> Olivier
>
>> Le 20 sept. 2017 à 11:13, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
>> a écrit :
>>
>>
>> Hi Olivier,
>>
>> Thanks for the additional details.
>> I also reproduced your numbers with Chameleon,
>> what confirms that the problem is not related to
>> the KNL configuration but rather to our PLASMA+StarPU code.
>>
>> I should say that so far we are not using any performance model
>> while Chameleon seems to provide some guidances.
>> While the priority option may be a potential explanation,
>> I am still confused because I observed the same low performance
>> issue for DGEMM. And DGEMM relies on a single kernel and doesn't need any
>> priority.
>>
>> KSTAR has the same issue on KNL as well, what suggest that both our native
>> StarPU code
>> and KSTAT have similar bottleneck. It is becoming more curious to realize
>> that on a 20-core Intel Haswell and on a 28-core Intel Broadwell, KSTAR,
>> our native StarPU and Chameleon have comparable performance. And in
>> addition our native StarPU and KSTAR outperform Chameleon slightly for
>> large size matrices on the 28-core Intel Broadwell.
>>
>>
>> To sum up, so far our finding is that on regular Intel architecture, all
>> the libraries have similar performance. On Intel KNL, PLASMA+ OpenMP and
>> Chameleon (StarPU) are providing a satisfactory result, while KSTAR and
>> our native PLASMA+StarPU suffer from a serious performance penalty. Your
>> further investigation revealed that the penalty might be due to the fact
>> that the native PLASMA+StarPU spent twice time sleeping compared to
>> Chameleon(based on StarPU).
>> Knowing that DGEMM, that does not require any priority, suffers from the
>> same issue, have you in mind other potential factors that may cause such
>> an effect?
>>
>> PS: Your Chameleon experiments are based on "time_dpotrf_tile", which is
>> a tile Cholesky kernel that does not include the time for layout
>> conversion (Lapack layout conversion to tile layout, and conversion
>> back). That is why you can observe a performance high than the OpenMP
>> based version. For the sack of comparison, "time_dpotrf" should be
>> preferred. But the difference of performance is reasonably low.
>> Thanks for the support.
>>
>> Best regards,
>> --Mawussi
>>
>>
>>
>> From: Olivier Aumage [olivier.aumage@inria.fr]
>> Sent: Tuesday, September 19, 2017 3:31 PM
>> To: Mawussi Zounon
>> Cc: Negin Bagherpour; starpu-devel@lists.gforge.inria.fr; Jakub Sistek
>> Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel
>> self-hosted KNL
>>
>> Hi Mawussi,
>>
>> I have been able to test your port of Plasma over StarPU on the same KNL
>> machine I used with Chameleon+StarPU. I used the numactl method of MCDRAM
>> allocation. The KNL is in flat,quad mode. The configure and environment
>> settings for StarPU were the same as with Chameleon. The StarPU scheduler
>> is 'lws'
>>
>> Out of the box, I get the results in the 'plasma_starpu.txt' file attached
>> which are similar to what you obtained, with a maximum of ~740 GFlop/s
>> (bs=660).
>> I obtained slightly better results (827 GFlop/s, bs=480) by compiling your
>> Plasma library without the '-fopenmp' flag. However, this is still much
>> below what we should obtain.
>>
>> Thus, I ran a test with StarPU workers' statistics enabled, with the
>> following environment variables, for N=20000 and BS=420:
>> export STARPU_PROFILING=1
>> export STARPU_WORKER_STATS=1
>>
>> The results for Plasma/StarPU and Chameleon/StarPU are in the
>> worker_activity_*.txt files attached. You will see that for both libs:
>> - the workers execute roughly the same number of kernels: ~320 tasks;
>> - the worker time spent executing is roughly the same: ~ 1700 ms;
>> - the worker time spent sleeping for the Plasma+StarPU execution (~950 ms)
>> is slightly more than 2x the time spent sleeping for the Chameleon+StarPU
>> execution (~380ms).
>>
>> Thus, this strongly suggests that the Plasma+StarPU execution suffers from
>> lack of parallelism. This lack of parallelism is likely due to the lack of
>> priorities to guide the execution over the critical path.
>>
>> Best regards,
>> --
>> Olivier
>>
>>
>>
>>
>>> Le 19 sept. 2017 à 10:16, Olivier Aumage <olivier.aumage@inria.fr> a
>>> écrit :
>>>
>>> Dear Mawussi,
>>>
>>> I was curious to check the impact of "numactl -m 1" over hbw_malloc() for
>>> StarPU. I used hbw_malloc only for allocating the matrix, while "numactl
>>> -m 1" puts every data structure (even StarPU's tasks queues, data handles
>>> and synchronization objects) into the MCDRAM. Since, the MCDRAM has a
>>> higher bandwidth, but also a higher latency, I did not know whether the
>>> benefit of the higher bandwidth would be compensated by higher latency
>>> costs on synchronization objects.
>>>
>>> It turns out that the global "numactl -m 1" approach gives better results
>>> than the matrix-only hbw_malloc() approach. The best numactl result I
>>> obtained (with block size 448) is almost 200 GFlop/s higher than the best
>>> result with the matrix-only hbw_malloc():
>>>
>>> %----------------
>>> # CHAMELEON 0.9.1,
>>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>>> # Nb threads: 68
>>> # Nb GPUs: 0
>>> # Nb mpi: 1
>>> # PxQ: 1x1
>>> # NB: 448
>>> # IB: 32
>>> # eps: 1.110223e-16
>>> #
>>> # M N K/NRHS seconds Gflop/s Deviation
>>> 2000 2000 1 0.047 56.44 +- 2.11
>>> 4000 4000 1 0.101 211.59 +- 2.88
>>> 6000 6000 1 0.158 455.82 +- 5.43
>>> 8000 8000 1 0.222 770.25 +- 7.23
>>> 10000 10000 1 0.317 1052.57 +- 21.74
>>> 12000 12000 1 0.460 1251.44 +- 17.22
>>> 14000 14000 1 0.676 1353.75 +- 20.59
>>> 16000 16000 1 0.962 1420.32 +- 16.61
>>> 18000 18000 1 1.329 1462.70 +- 21.28
>>> 20000 20000 1 1.789 1490.79 +- 9.21
>>> %----------------
>>>
>>> Here are the test settings:
>>>
>>> - libhwloc:
>>> . version 1.11.7
>>> . no specific settings
>>>
>>> - StarPU:
>>> . Version: Subversion repository, branch trunk/, revision r22030
>>> . Compiler: Intel 17
>>> . Configure flags (I give a snippet from my GNU Bash script):
>>> %--------
>>> declare -a cfg
>>> cfg+=("--enable-shared")
>>> cfg+=("--disable-cuda")
>>> cfg+=("--disable-opencl")
>>> cfg+=("--disable-socl")
>>> cfg+=("--without-fxt")
>>> cfg+=("--disable-debug")
>>> cfg+=("--enable-fast")
>>> cfg+=("--disable-verbose")
>>> cfg+=("--disable-gcc-extensions")
>>> cfg+=("--disable-mpi-check")
>>> cfg+=("--disable-starpu-top")
>>> cfg+=("--disable-starpufft")
>>> cfg+=("--disable-build-doc")
>>> cfg+=("--disable-openmp")
>>> cfg+=("--disable-fortran")
>>> cfg+=("--disable-build-tests")
>>> cfg+=("--disable-build-examples")
>>> cfg+=("--enable-mpi")
>>> cfg+=("--enable-blas-lib=none")
>>> cfg+=("--disable-mlr")
>>> cfg+=("--enable-maxcpus=72")
>>> $STARPU_SRC_DIR/configure --prefix=$STARPU_INSTALL_DIR "${cfg[@]}"
>>> %--------
>>>
>>> - Chameleon settings:
>>> . compiler / mkl: Intel 17
>>> . cmake flags:
>>> cmake \
>>> -DCHAMELEON_ENABLE_EXAMPLE=OFF \
>>> -DBLAS_VERBOSE=ON \
>>> -DCHAMELEON_USE_CUDA=OFF \
>>> -DCHAMELEON_USE_MPI=ON \
>>> -DCHAMELEON_SIMULATION=OFF \
>>> -DCHAMELEON_SCHED_STARPU=ON \
>>> -DCMAKE_INSTALL_PREFIX=$HOME/Linalg/install-ch \
>>> -DCMAKE_C_COMPILER=icc \
>>> -DCMAKE_CXX_COMPILER=icpc \
>>> -DCMAKE_Fortran_COMPILER=ifort \
>>> -DCMAKE_BUILD_TYPE=Release \
>>> ../chameleon.git
>>>
>>> - Launch settings:
>>> STARPU_NCPU=68 STARPU_SCHED=lws numactl -m 1
>>> ./install-ch/lib/chameleon/timing/time_dpotrf_tile -N 2000:20000:2000 -b
>>> $448 --niter 10
>>>
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>> Le 18 sept. 2017 à 23:09, Mawussi Zounon
>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>
>>>> Dear Olivier,
>>>>
>>>> Thanks for taking your time to run the experiments.
>>>> It is quite comforting to see your results. It is quite
>>>> close to the numbers I got when using OpenMP.
>>>>
>>>> I re-run the tests again using the same nb as you,
>>>> but my performance didn't improve.
>>>>
>>>> I have notice two main differences:
>>>>
>>>> • I am using "numactl -m 1" while you are allocating the memory
>>>> via hbw_malloc(). From my experiments, both ways are equivalent in
>>>> terms of performance.
>>>> • I have just noticed that my KNL is currently configure in hybrid
>>>> mode: 8GB allocatable and 8GB en cache mode. I will advocate to set the
>>>> machine in flat mode then perform the experiment again.
>>>> Did you use any specific installation flag when installing StarPU on KNL?
>>>>
>>>> Best regards,
>>>> --Mawussi
>>>>
>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>> Sent: Monday, September 18, 2017 4:43 PM
>>>> To: Mawussi Zounon
>>>> Cc: Samuel Thibault; Jakub Sistek; Negin Bagherpour;
>>>> starpu-devel@lists.gforge.inria.fr
>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>
>>>> Dear Mawussi,
>>>>
>>>> Thanks for the tarball and the Google sheet. I will run it to try to
>>>> understand what is going on.
>>>>
>>>> Today I managed to run a dpotrf test from Chameleon with native StarPU.
>>>> I modified Chameleon to use hbw_malloc() for the matrix. The test was
>>>> run on the machine Frioul from CINES
>>>> (https://www.cines.fr/le-supercalculateur-frioul/), with the following
>>>> specs:
>>>> - Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
>>>> - Intel icc/ifort 17 + mkl 17
>>>> - StarPU scheduler: lws
>>>>
>>>> The best result I got is the following one, using a blocksize of 424,
>>>> reaching about 1.3TFlop/s for 20000x20000:
>>>> #---------------
>>>> #
>>>> # CHAMELEON 0.9.1,
>>>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>>>> # Nb threads: 68
>>>> # Nb GPUs: 0
>>>> # Nb mpi: 1
>>>> # PxQ: 1x1
>>>> # NB: 424
>>>> # IB: 32
>>>> # eps: 1.110223e-16
>>>> #
>>>> # M N K/NRHS seconds Gflop/s Deviation
>>>> 2000 2000 1 0.045 59.15 4.64
>>>> 4000 4000 1 0.093 229.03 3.86
>>>> 6000 6000 1 0.152 472.40 5.69
>>>> 8000 8000 1 0.230 742.36 9.10
>>>> 10000 10000 1 0.351 950.89 5.99
>>>> 12000 12000 1 0.537 1072.95 11.30
>>>> 14000 14000 1 0.783 1167.68 10.11
>>>> 16000 16000 1 1.108 1232.57 6.83
>>>> 18000 18000 1 1.543 1259.95 7.27
>>>> 20000 20000 1 2.037 1309.28 6.66
>>>> #---------------
>>>>
>>>> The results are highly sensitive to the block size. I also attach a plot
>>>> showing the performance for various block sizes. It seems I used smaller
>>>> blocks than in your tests. I do not know at this time whether this is
>>>> the main explanation for the performance difference with see or not.
>>>>
>>>> I will study the data you sent me and come back to you asap.
>>>>
>>>> Thanks again.
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>>
>>>>
>>>>> Le 18 sept. 2017 à 17:15, Mawussi Zounon
>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>
>>>>> Dear Olivier,
>>>>> Please find attached the tarball of the StarPU version of plasma.
>>>>> the compilation should be straightforward, but the make.inc can be
>>>>> customized.
>>>>> The tests are in the directory "test".
>>>>> To test dgemm for example, you can run:
>>>>> STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size
>>>>> --nb=block_size, --test=n
>>>>>
>>>>> "numactl -m 1" to specify to allocate the memory in the MCDRAM
>>>>> "--dim" to specify the size of the problem
>>>>> "--nb" to specify the block size
>>>>> "--test=n" to disable testing, to save the benchmark time.
>>>>>
>>>>> In general ./test "routine_name" --help will give you more details.
>>>>>
>>>>> I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
>>>>>
>>>>> I also shared a Google sheet with you to have an idea on the optimal
>>>>> NBs.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --Mawussi
>>>>>
>>>>> ________________________________________
>>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>>> Sent: Monday, September 18, 2017 9:55 AM
>>>>> To: Mawussi Zounon
>>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>>
>>>>> Hi Mawussi,
>>>>>
>>>>> I would like to try to reproduce your results with native StarPU and
>>>>> with LAPACK on KNL, to hopefully reduce a little bit the search space
>>>>> for possible explanations. Is it possible to have a tar of your current
>>>>> native StarPU port, with the 'configure' options you use, the
>>>>> environment variables (if any), and the testing program ?
>>>>>
>>>>> Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use
>>>>> the current version on Netlib or is it a modified version? Could you
>>>>> give me the Makefile settings and environment variables if any?
>>>>>
>>>>> For allocating the matrix in MCDRAM, do you use hbw_malloc() from the
>>>>> MemKind library, or do you use some other means? Do you get similar,
>>>>> better or worse results when the MCDRAM is in 'cache' mode ?
>>>>>
>>>>> Thanks in advance.
>>>>> Best regards,
>>>>> --
>>>>> Olivier
>>>>>
>>>>>> Le 15 sept. 2017 à 11:49, Mawussi Zounon
>>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>>
>>>>>> Dear Olivier,
>>>>>>
>>>>>> Thanks for the pointing out the difference KSTAR and a native StarPU.
>>>>>> I thought KSTAR performs a source to source compilation
>>>>>> by replacing completely all OpenMP features by StarPU equivalents.
>>>>>> But from your explanation, I have the impression that some OpenMP
>>>>>> features
>>>>>> remain in the executable produced by KSTAR, and this impacts the
>>>>>> behaviour of StarPU.
>>>>>> Please, can you provide us with a relevant reference on the KSTAR for
>>>>>> a better understanding
>>>>>> of how it works?
>>>>>>
>>>>>> The behaviour of the main thread seems a reasonable option to
>>>>>> investigate to improve the performance penalty
>>>>>> of the native StarPU code for small size matrices.
>>>>>>
>>>>>> On KNL, when using the MCDRAM, we used to observe some performance
>>>>>> drop even for MKL
>>>>>> from some matrix sizes. We have some potential explanations but we
>>>>>> need further experiments for confirmation.
>>>>>> However, even for reasonably small size matrices both the native
>>>>>> StarPU and the KSTAR generated code fail
>>>>>> to exploit the 68 cores efficiently; and their performance can even be
>>>>>> worse than LAPACK.
>>>>>> I think we should pay a close attention to the behaviour of StarPU on
>>>>>> the Intel self-hosted KNL.
>>>>>>
>>>>>> Regarding the question on the choice of the block size, I reported
>>>>>> only the auto-tuned results.
>>>>>> For each executable (OMP, STARPU, KSTAR), and each matrix size, we
>>>>>> perform
>>>>>> a sweep over a large "block size" space. In general, for a given
>>>>>> matrix size,
>>>>>> OMP, STARPU, and KSTAR achieve the highest performance for almost the
>>>>>> same "block size".
>>>>>> But for some routines, larger "block sizes" (less tasks) seem to
>>>>>> benefit the native StarPU.
>>>>>>
>>>>>> If it can help, find attached some results on an Intel Broadwell, a
>>>>>> 28-core NUMA node (14 cores per socket).
>>>>>> These results are very similar to the one obtained on the 20-core
>>>>>> Haswell. OPM, KSTAR and STARPU
>>>>>> have a similar asymptotic performance, while the native StarPU is
>>>>>> penalized for small size matrices.
>>>>>> At some extend, this confirm that the results on KNL have some serious
>>>>>> issues and it worth
>>>>>> investigating.
>>>>>>
>>>>>> Best regards,
>>>>>> --Mawussi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>>>> Sent: Thursday, September 14, 2017 2:51 PM
>>>>>> To: Mawussi Zounon
>>>>>> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek;
>>>>>> Negin Bagherpour
>>>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>>>
>>>>>> Hi Mawussi,
>>>>>>
>>>>>> The main difference between StarPU used alone and StarPU used through
>>>>>> the OpenMP compatibility layer concerns the 'main' thread of the
>>>>>> application:
>>>>>>
>>>>>> - When StarPU is used alone on a N-core machine, it launches N worker
>>>>>> threads bound on the N cores. (Thus there is a total of N+1 threads: 1
>>>>>> main application thread + N StarPU workers) The main application
>>>>>> thread is only used for submitting tasks to StarPU, it does not
>>>>>> execute tasks and it is not bound to a core unless the application
>>>>>> binds it explicitly.
>>>>>>
>>>>>> - When StarPU is used through the OpenMP compatibility layer on a
>>>>>> N-core, it launches N-1 worker threads bounds to N-1 cores. (Thus
>>>>>> there is a total of N threads: 1 main application thread + N-1 StarPU
>>>>>> workers) The main application thread is used for submitting tasks
>>>>>> _and_ for participating in task execution while blocked on some
>>>>>> barrier (e.g.: omp taskwait, implicit barriers at end of parallel
>>>>>> regions, ...). This behaviour is required for compliance with the
>>>>>> OpenMP execution model.
>>>>>>
>>>>>> I am not sure whether this difference is the unique cause of the
>>>>>> performance mismatch you observed, but this probably count for some
>>>>>> significant part at least, generally in favor of the StarPU+OpenMP.
>>>>>> This may be the main factor for the difference on small matrices,
>>>>>> where the main thread of the StarPU+OpenMP version can quickly give an
>>>>>> hand to the computation, while the main thread for the native StarPU
>>>>>> version must first be de-scheduled by the OS Kernel to leave its core
>>>>>> to a worker thread.
>>>>>>
>>>>>> On the other hand, the management of tasks for StarPU+OpenMP is more
>>>>>> expensive than the management of StarPU native tasks, due to the fact
>>>>>> that StarPU+OpenMP tasks may block while native StarPU tasks never
>>>>>> block. This management difference is therefore in favor of the native
>>>>>> StarPU version. This additional management cost is perhaps more
>>>>>> expensive on the KNL where the cores are much less advanced than the
>>>>>> regular Intel xeon cores.
>>>>>>
>>>>>> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix
>>>>>> size >= 14000. Only could think about some NUMA issue, but this should
>>>>>> not be the case since you say that the matrix is allocated in MCDRAM.
>>>>>>
>>>>>> I do not know either why the StarPU+OpenMP plot and the native StarPU
>>>>>> plot cross for the dpotrf.KNl test case.
>>>>>>
>>>>>> How do you choose the block size for each test sample ? Is it fixed
>>>>>> for all matrix sizes or is it computed from the matrix size? Do you
>>>>>> observe very different behaviour for other block sizes (e.g. fewer
>>>>>> tasks on large blocks, more tasks on small blocks, ...)?
>>>>>>
>>>>>> Best regards,
>>>>>> --
>>>>>> Olivier
>>>>>>
>>>>>>> Le 14 sept. 2017 à 11:00, Mawussi Zounon
>>>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>>>
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> Recently we developed a new version of the PLASMA library fully based
>>>>>>> on the OpenMP task-based
>>>>>>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and
>>>>>>> Broadwell in the experiment) and Intel KNL, showed that the new
>>>>>>> OpenMP PLASMA has a performance comparable to the old version based
>>>>>>> on QUARK. This motivated us to extend the experiment to StarPU.
>>>>>>> To this end, on one hand we used KSTAR to generate a StarPU version
>>>>>>> of PLASMA. On another hand we developed another version of PLASMA
>>>>>>> (restricted to a few routines) based StarPU.
>>>>>>> It is important to note that the algorithms are the same; we simply
>>>>>>> replaced the task-based runtime system. Below are our findings:
>>>>>>>
>>>>>>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
>>>>>>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
>>>>>>> comparable performance, except for very small size matrices where our
>>>>>>> hand written StarPU version of PLASMA is outperformed by the generic
>>>>>>> KSTAR.
>>>>>>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
>>>>>>> version and KSTAR are significantly slow compared to OMP. But again
>>>>>>> our KSTAR and our StarPU version exhibited difference performance
>>>>>>> behaviour.
>>>>>>> I am wondering whether you can provide us with some hints or guidance
>>>>>>> to improve the performance of StarPU on the Intel KNL architecture.
>>>>>>> There might be some configuration options I missed. In addition I
>>>>>>> will be happy if you can help us to understand why our StarPU version
>>>>>>> seemed more penalized for small size matrices while KSTAR seems to be
>>>>>>> doing relatively better.
>>>>>>>
>>>>>>> Below some performance charts of dgemm and Cholesky (dpotrf) to
>>>>>>> illustrate our observations:
>>>>>>> <dgemm_haswell_rutimes.png>
>>>>>>>
>>>>>>>
>>>>>>> <dpotrf_haswell_rutimes.png>
>>>>>>>
>>>>>>> <dgemm_knl_rutimes.png>
>>>>>>>
>>>>>>> <dpotrf_knl_rutimes.png>
>>>>>>>
>>>>>>> For the experiments on KNL, the matrices have been allocated in the
>>>>>>> MCDRAM.
>>>>>>>
>>>>>>> I am looking forward to hearing from you.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> --Mawussi
>>>>>>
>>>>>> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
>>>>>
>>>>> <plasma_starpu.tar>
>>>
>>> _______________________________________________
>>> Starpu-devel mailing list
>>> Starpu-devel@lists.gforge.inria.fr
>>> https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel
>

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, (suite)

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL