starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
Date: Tue, 19 Sep 2017 10:16:34 +0200
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Dear Mawussi,

I was curious to check the impact of "numactl -m 1" over hbw_malloc() for
StarPU. I used hbw_malloc only for allocating the matrix, while "numactl -m
1" puts every data structure (even StarPU's tasks queues, data handles and
synchronization objects) into the MCDRAM. Since, the MCDRAM has a higher
bandwidth, but also a higher latency, I did not know whether the benefit of
the higher bandwidth would be compensated by higher latency costs on
synchronization objects.

It turns out that the global "numactl -m 1" approach gives better results
than the matrix-only hbw_malloc() approach. The best numactl result I
obtained (with block size 448) is almost 200 GFlop/s higher than the best
result with the matrix-only hbw_malloc():

%----------------
# CHAMELEON 0.9.1,
/home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
# Nb threads: 68
# Nb GPUs: 0
# Nb mpi: 1
# PxQ: 1x1
# NB: 448
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
2000 2000 1 0.047 56.44 +- 2.11
4000 4000 1 0.101 211.59 +- 2.88
6000 6000 1 0.158 455.82 +- 5.43
8000 8000 1 0.222 770.25 +- 7.23
10000 10000 1 0.317 1052.57 +- 21.74
12000 12000 1 0.460 1251.44 +- 17.22
14000 14000 1 0.676 1353.75 +- 20.59
16000 16000 1 0.962 1420.32 +- 16.61
18000 18000 1 1.329 1462.70 +- 21.28
20000 20000 1 1.789 1490.79 +- 9.21
%----------------

Here are the test settings:

- libhwloc:
. version 1.11.7
. no specific settings

- StarPU:
. Version: Subversion repository, branch trunk/, revision r22030
. Compiler: Intel 17
. Configure flags (I give a snippet from my GNU Bash script):
%--------
declare -a cfg
cfg+=("--enable-shared")
cfg+=("--disable-cuda")
cfg+=("--disable-opencl")
cfg+=("--disable-socl")
cfg+=("--without-fxt")
cfg+=("--disable-debug")
cfg+=("--enable-fast")
cfg+=("--disable-verbose")
cfg+=("--disable-gcc-extensions")
cfg+=("--disable-mpi-check")
cfg+=("--disable-starpu-top")
cfg+=("--disable-starpufft")
cfg+=("--disable-build-doc")
cfg+=("--disable-openmp")
cfg+=("--disable-fortran")
cfg+=("--disable-build-tests")
cfg+=("--disable-build-examples")
cfg+=("--enable-mpi")
cfg+=("--enable-blas-lib=none")
cfg+=("--disable-mlr")
cfg+=("--enable-maxcpus=72")
$STARPU_SRC_DIR/configure --prefix=$STARPU_INSTALL_DIR "${cfg[@]}"
%--------

- Chameleon settings:
. compiler / mkl: Intel 17
. cmake flags:
cmake \
-DCHAMELEON_ENABLE_EXAMPLE=OFF \
-DBLAS_VERBOSE=ON \
-DCHAMELEON_USE_CUDA=OFF \
-DCHAMELEON_USE_MPI=ON \
-DCHAMELEON_SIMULATION=OFF \
-DCHAMELEON_SCHED_STARPU=ON \
-DCMAKE_INSTALL_PREFIX=$HOME/Linalg/install-ch \
-DCMAKE_C_COMPILER=icc \
-DCMAKE_CXX_COMPILER=icpc \
-DCMAKE_Fortran_COMPILER=ifort \
-DCMAKE_BUILD_TYPE=Release \
../chameleon.git

- Launch settings:
STARPU_NCPU=68 STARPU_SCHED=lws numactl -m 1
./install-ch/lib/chameleon/timing/time_dpotrf_tile -N 2000:20000:2000 -b
$448 --niter 10

Best regards,
--
Olivier

> Le 18 sept. 2017 à 23:09, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
> a écrit :
>
> Dear Olivier,
>
> Thanks for taking your time to run the experiments.
> It is quite comforting to see your results. It is quite
> close to the numbers I got when using OpenMP.
>
> I re-run the tests again using the same nb as you,
> but my performance didn't improve.
>
> I have notice two main differences:
>
> • I am using "numactl -m 1" while you are allocating the memory via
> hbw_malloc(). From my experiments, both ways are equivalent in terms of
> performance.
> • I have just noticed that my KNL is currently configure in hybrid
> mode: 8GB allocatable and 8GB en cache mode. I will advocate to set the
> machine in flat mode then perform the experiment again.
> Did you use any specific installation flag when installing StarPU on KNL?
>
> Best regards,
> --Mawussi
>
> From: Olivier Aumage [olivier.aumage@inria.fr]
> Sent: Monday, September 18, 2017 4:43 PM
> To: Mawussi Zounon
> Cc: Samuel Thibault; Jakub Sistek; Negin Bagherpour;
> starpu-devel@lists.gforge.inria.fr
> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>
> Dear Mawussi,
>
> Thanks for the tarball and the Google sheet. I will run it to try to
> understand what is going on.
>
> Today I managed to run a dpotrf test from Chameleon with native StarPU. I
> modified Chameleon to use hbw_malloc() for the matrix. The test was run on
> the machine Frioul from CINES
> (https://www.cines.fr/le-supercalculateur-frioul/), with the following
> specs:
> - Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
> - Intel icc/ifort 17 + mkl 17
> - StarPU scheduler: lws
>
> The best result I got is the following one, using a blocksize of 424,
> reaching about 1.3TFlop/s for 20000x20000:
> #---------------
> #
> # CHAMELEON 0.9.1,
> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
> # Nb threads: 68
> # Nb GPUs: 0
> # Nb mpi: 1
> # PxQ: 1x1
> # NB: 424
> # IB: 32
> # eps: 1.110223e-16
> #
> # M N K/NRHS seconds Gflop/s Deviation
> 2000 2000 1 0.045 59.15 4.64
> 4000 4000 1 0.093 229.03 3.86
> 6000 6000 1 0.152 472.40 5.69
> 8000 8000 1 0.230 742.36 9.10
> 10000 10000 1 0.351 950.89 5.99
> 12000 12000 1 0.537 1072.95 11.30
> 14000 14000 1 0.783 1167.68 10.11
> 16000 16000 1 1.108 1232.57 6.83
> 18000 18000 1 1.543 1259.95 7.27
> 20000 20000 1 2.037 1309.28 6.66
> #---------------
>
> The results are highly sensitive to the block size. I also attach a plot
> showing the performance for various block sizes. It seems I used smaller
> blocks than in your tests. I do not know at this time whether this is the
> main explanation for the performance difference with see or not.
>
> I will study the data you sent me and come back to you asap.
>
> Thanks again.
> Best regards,
> --
> Olivier
>
>
>
> > Le 18 sept. 2017 à 17:15, Mawussi Zounon
> > <mawussi.zounon@manchester.ac.uk> a écrit :
> >
> > Dear Olivier,
> > Please find attached the tarball of the StarPU version of plasma.
> > the compilation should be straightforward, but the make.inc can be
> > customized.
> > The tests are in the directory "test".
> > To test dgemm for example, you can run:
> > STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size
> > --nb=block_size, --test=n
> >
> > "numactl -m 1" to specify to allocate the memory in the MCDRAM
> > "--dim" to specify the size of the problem
> > "--nb" to specify the block size
> > "--test=n" to disable testing, to save the benchmark time.
> >
> > In general ./test "routine_name" --help will give you more details.
> >
> > I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
> >
> > I also shared a Google sheet with you to have an idea on the optimal NBs.
> >
> > Best regards,
> >
> > --Mawussi
> >
> > ________________________________________
> > From: Olivier Aumage [olivier.aumage@inria.fr]
> > Sent: Monday, September 18, 2017 9:55 AM
> > To: Mawussi Zounon
> > Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
> >
> > Hi Mawussi,
> >
> > I would like to try to reproduce your results with native StarPU and with
> > LAPACK on KNL, to hopefully reduce a little bit the search space for
> > possible explanations. Is it possible to have a tar of your current
> > native StarPU port, with the 'configure' options you use, the environment
> > variables (if any), and the testing program ?
> >
> > Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use the
> > current version on Netlib or is it a modified version? Could you give me
> > the Makefile settings and environment variables if any?
> >
> > For allocating the matrix in MCDRAM, do you use hbw_malloc() from the
> > MemKind library, or do you use some other means? Do you get similar,
> > better or worse results when the MCDRAM is in 'cache' mode ?
> >
> > Thanks in advance.
> > Best regards,
> > --
> > Olivier
> >
> >> Le 15 sept. 2017 à 11:49, Mawussi Zounon
> >> <mawussi.zounon@manchester.ac.uk> a écrit :
> >>
> >> Dear Olivier,
> >>
> >> Thanks for the pointing out the difference KSTAR and a native StarPU.
> >> I thought KSTAR performs a source to source compilation
> >> by replacing completely all OpenMP features by StarPU equivalents.
> >> But from your explanation, I have the impression that some OpenMP
> >> features
> >> remain in the executable produced by KSTAR, and this impacts the
> >> behaviour of StarPU.
> >> Please, can you provide us with a relevant reference on the KSTAR for a
> >> better understanding
> >> of how it works?
> >>
> >> The behaviour of the main thread seems a reasonable option to
> >> investigate to improve the performance penalty
> >> of the native StarPU code for small size matrices.
> >>
> >> On KNL, when using the MCDRAM, we used to observe some performance drop
> >> even for MKL
> >> from some matrix sizes. We have some potential explanations but we need
> >> further experiments for confirmation.
> >> However, even for reasonably small size matrices both the native StarPU
> >> and the KSTAR generated code fail
> >> to exploit the 68 cores efficiently; and their performance can even be
> >> worse than LAPACK.
> >> I think we should pay a close attention to the behaviour of StarPU on
> >> the Intel self-hosted KNL.
> >>
> >> Regarding the question on the choice of the block size, I reported only
> >> the auto-tuned results.
> >> For each executable (OMP, STARPU, KSTAR), and each matrix size, we
> >> perform
> >> a sweep over a large "block size" space. In general, for a given matrix
> >> size,
> >> OMP, STARPU, and KSTAR achieve the highest performance for almost the
> >> same "block size".
> >> But for some routines, larger "block sizes" (less tasks) seem to benefit
> >> the native StarPU.
> >>
> >> If it can help, find attached some results on an Intel Broadwell, a
> >> 28-core NUMA node (14 cores per socket).
> >> These results are very similar to the one obtained on the 20-core
> >> Haswell. OPM, KSTAR and STARPU
> >> have a similar asymptotic performance, while the native StarPU is
> >> penalized for small size matrices.
> >> At some extend, this confirm that the results on KNL have some serious
> >> issues and it worth
> >> investigating.
> >>
> >> Best regards,
> >> --Mawussi
> >>
> >>
> >>
> >>
> >> ________________________________________
> >> From: Olivier Aumage [olivier.aumage@inria.fr]
> >> Sent: Thursday, September 14, 2017 2:51 PM
> >> To: Mawussi Zounon
> >> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek;
> >> Negin Bagherpour
> >> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
> >>
> >> Hi Mawussi,
> >>
> >> The main difference between StarPU used alone and StarPU used through
> >> the OpenMP compatibility layer concerns the 'main' thread of the
> >> application:
> >>
> >> - When StarPU is used alone on a N-core machine, it launches N worker
> >> threads bound on the N cores. (Thus there is a total of N+1 threads: 1
> >> main application thread + N StarPU workers) The main application thread
> >> is only used for submitting tasks to StarPU, it does not execute tasks
> >> and it is not bound to a core unless the application binds it explicitly.
> >>
> >> - When StarPU is used through the OpenMP compatibility layer on a
> >> N-core, it launches N-1 worker threads bounds to N-1 cores. (Thus there
> >> is a total of N threads: 1 main application thread + N-1 StarPU workers)
> >> The main application thread is used for submitting tasks _and_ for
> >> participating in task execution while blocked on some barrier (e.g.: omp
> >> taskwait, implicit barriers at end of parallel regions, ...). This
> >> behaviour is required for compliance with the OpenMP execution model.
> >>
> >> I am not sure whether this difference is the unique cause of the
> >> performance mismatch you observed, but this probably count for some
> >> significant part at least, generally in favor of the StarPU+OpenMP. This
> >> may be the main factor for the difference on small matrices, where the
> >> main thread of the StarPU+OpenMP version can quickly give an hand to the
> >> computation, while the main thread for the native StarPU version must
> >> first be de-scheduled by the OS Kernel to leave its core to a worker
> >> thread.
> >>
> >> On the other hand, the management of tasks for StarPU+OpenMP is more
> >> expensive than the management of StarPU native tasks, due to the fact
> >> that StarPU+OpenMP tasks may block while native StarPU tasks never
> >> block. This management difference is therefore in favor of the native
> >> StarPU version. This additional management cost is perhaps more
> >> expensive on the KNL where the cores are much less advanced than the
> >> regular Intel xeon cores.
> >>
> >> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix size
> >> >= 14000. Only could think about some NUMA issue, but this should not be
> >> the case since you say that the matrix is allocated in MCDRAM.
> >>
> >> I do not know either why the StarPU+OpenMP plot and the native StarPU
> >> plot cross for the dpotrf.KNl test case.
> >>
> >> How do you choose the block size for each test sample ? Is it fixed for
> >> all matrix sizes or is it computed from the matrix size? Do you observe
> >> very different behaviour for other block sizes (e.g. fewer tasks on
> >> large blocks, more tasks on small blocks, ...)?
> >>
> >> Best regards,
> >> --
> >> Olivier
> >>
> >>> Le 14 sept. 2017 à 11:00, Mawussi Zounon
> >>> <mawussi.zounon@manchester.ac.uk> a écrit :
> >>>
> >>>
> >>> Dear all,
> >>>
> >>> Recently we developed a new version of the PLASMA library fully based
> >>> on the OpenMP task-based
> >>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and
> >>> Broadwell in the experiment) and Intel KNL, showed that the new OpenMP
> >>> PLASMA has a performance comparable to the old version based on QUARK.
> >>> This motivated us to extend the experiment to StarPU.
> >>> To this end, on one hand we used KSTAR to generate a StarPU version of
> >>> PLASMA. On another hand we developed another version of PLASMA
> >>> (restricted to a few routines) based StarPU.
> >>> It is important to note that the algorithms are the same; we simply
> >>> replaced the task-based runtime system. Below are our findings:
> >>>
> >>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
> >>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
> >>> comparable performance, except for very small size matrices where our
> >>> hand written StarPU version of PLASMA is outperformed by the generic
> >>> KSTAR.
> >>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
> >>> version and KSTAR are significantly slow compared to OMP. But again
> >>> our KSTAR and our StarPU version exhibited difference performance
> >>> behaviour.
> >>> I am wondering whether you can provide us with some hints or guidance
> >>> to improve the performance of StarPU on the Intel KNL architecture.
> >>> There might be some configuration options I missed. In addition I will
> >>> be happy if you can help us to understand why our StarPU version seemed
> >>> more penalized for small size matrices while KSTAR seems to be doing
> >>> relatively better.
> >>>
> >>> Below some performance charts of dgemm and Cholesky (dpotrf) to
> >>> illustrate our observations:
> >>> <dgemm_haswell_rutimes.png>
> >>>
> >>>
> >>> <dpotrf_haswell_rutimes.png>
> >>>
> >>> <dgemm_knl_rutimes.png>
> >>>
> >>> <dpotrf_knl_rutimes.png>
> >>>
> >>> For the experiments on KNL, the matrices have been allocated in the
> >>> MCDRAM.
> >>>
> >>> I am looking forward to hearing from you.
> >>>
> >>> Best regards,
> >>>
> >>> --Mawussi
> >>
> >> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
> >
> > <plasma_starpu.tar>

[Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 14/09/2017
- Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 14/09/2017
  - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 15/09/2017
    - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 15/09/2017
    - Message indisponible
      - Message indisponible
        
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 18/09/2017
        
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 18/09/2017
        
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 19/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 19/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Samuel Thibault, 19/09/2017
        
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 20/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 20/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 21/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 21/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 21/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 21/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Mawussi Zounon, 21/09/2017
        Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL, Olivier Aumage, 22/09/2017

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL