Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL

Veuillez patienter...

starpu-devel@inria.fr

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL


Chronologique Discussions 
  • From: Olivier Aumage <olivier.aumage@inria.fr>
  • To: Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
  • Cc: Negin Bagherpour <negin.bagherpour@manchester.ac.uk>, "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>, Jakub Sistek <jakub.sistek@manchester.ac.uk>
  • Subject: Re: [Starpu-devel] Performance issue of StarPU on Intel self-hosted KNL
  • Date: Tue, 19 Sep 2017 16:31:36 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Mawussi,

I have been able to test your port of Plasma over StarPU on the same KNL
machine I used with Chameleon+StarPU. I used the numactl method of MCDRAM
allocation. The KNL is in flat,quad mode. The configure and environment
settings for StarPU were the same as with Chameleon. The StarPU scheduler is
'lws'

Out of the box, I get the results in the 'plasma_starpu.txt' file attached
which are similar to what you obtained, with a maximum of ~740 GFlop/s
(bs=660).
I obtained slightly better results (827 GFlop/s, bs=480) by compiling your
Plasma library without the '-fopenmp' flag. However, this is still much below
what we should obtain.

Thus, I ran a test with StarPU workers' statistics enabled, with the
following environment variables, for N=20000 and BS=420:
export STARPU_PROFILING=1
export STARPU_WORKER_STATS=1

The results for Plasma/StarPU and Chameleon/StarPU are in the
worker_activity_*.txt files attached. You will see that for both libs:
- the workers execute roughly the same number of kernels: ~320 tasks;
- the worker time spent executing is roughly the same: ~ 1700 ms;
- the worker time spent sleeping for the Plasma+StarPU execution (~950 ms) is
slightly more than 2x the time spent sleeping for the Chameleon+StarPU
execution (~380ms).

Thus, this strongly suggests that the Plasma+StarPU execution suffers from
lack of parallelism. This lack of parallelism is likely due to the lack of
priorities to guide the execution over the critical path.

Best regards,
--
Olivier



Status Error Time Gflop/s uplo n nb padA zerocol
-- -- 10.6894 249.4877 l 20000 320 0 -1
-- -- 6.4205 415.3654 l 20000 340 0 -1
-- -- 5.3465 498.8102 l 20000 360 0 -1
-- -- 4.5037 592.1542 l 20000 380 0 -1
-- -- 3.9746 670.9789 l 20000 400 0 -1
-- -- 3.8990 683.9791 l 20000 420 0 -1
-- -- 3.9440 676.1759 l 20000 440 0 -1
-- -- 3.6349 733.6812 l 20000 460 0 -1
-- -- 3.7923 703.2355 l 20000 480 0 -1
-- -- 3.6944 721.8580 l 20000 500 0 -1
-- -- 3.6230 736.0843 l 20000 520 0 -1
-- -- 3.8940 684.8616 l 20000 540 0 -1
-- -- 3.6965 721.4551 l 20000 560 0 -1
-- -- 3.7800 705.5121 l 20000 580 0 -1
-- -- 3.7619 708.9065 l 20000 600 0 -1
-- -- 3.7002 720.7282 l 20000 620 0 -1
-- -- 3.9474 675.6010 l 20000 640 0 -1
-- -- 3.6043 739.9088 l 20000 660 0 -1
-- -- 4.0681 655.5481 l 20000 680 0 -1
-- -- 3.7802 705.4859 l 20000 700 0 -1
-- -- 3.8499 692.7020 l 20000 720 0 -1
-- -- 3.9957 667.4405 l 20000 740 0 -1
-- -- 3.8617 690.5855 l 20000 760 0 -1
-- -- 3.9026 683.3631 l 20000 780 0 -1
-- -- 4.0310 661.5817 l 20000 800 0 -1
-- -- 4.1012 650.2688 l 20000 820 0 -1
-- -- 3.7880 704.0372 l 20000 840 0 -1
-- -- 3.7988 702.0217 l 20000 860 0 -1
-- -- 3.8396 694.5633 l 20000 880 0 -1
-- -- 3.9301 678.5813 l 20000 900 0 -1
-- -- 3.8721 688.7334 l 20000 920 0 -1
-- -- 4.0102 665.0166 l 20000 940 0 -1
-- -- 3.9074 682.5215 l 20000 960 0 -1
Status Error Time Gflop/s uplo n nb padA zerocol
-- -- 3.7329 714.4142 l 20000 320 0 -1
-- -- 3.4350 776.3824 l 20000 340 0 -1
-- -- 3.3383 798.8649 l 20000 360 0 -1
-- -- 3.3598 793.7684 l 20000 380 0 -1
-- -- 3.2638 817.1128 l 20000 400 0 -1
-- -- 3.3539 795.1470 l 20000 420 0 -1
-- -- 3.3177 803.8387 l 20000 440 0 -1
-- -- 3.2974 808.7895 l 20000 460 0 -1
-- -- 3.2211 827.9289 l 20000 480 0 -1
-- -- 3.2365 824.0068 l 20000 500 0 -1
-- -- 3.3004 808.0332 l 20000 520 0 -1
-- -- 3.2506 820.4173 l 20000 540 0 -1
-- -- 3.2321 825.1270 l 20000 560 0 -1
-- -- 3.3409 798.2475 l 20000 580 0 -1
-- -- 3.2760 814.0511 l 20000 600 0 -1
-- -- 3.3648 792.5703 l 20000 620 0 -1
-- -- 3.5335 754.7457 l 20000 640 0 -1
-- -- 3.4081 782.5039 l 20000 660 0 -1
-- -- 3.5221 757.1834 l 20000 680 0 -1
-- -- 3.4949 763.0668 l 20000 700 0 -1
-- -- 3.4667 769.2922 l 20000 720 0 -1
-- -- 3.5446 752.3654 l 20000 740 0 -1
-- -- 3.4351 776.3510 l 20000 760 0 -1
-- -- 3.6408 732.4931 l 20000 780 0 -1
-- -- 3.5070 760.4410 l 20000 800 0 -1
-- -- 3.8288 696.5226 l 20000 820 0 -1
-- -- 3.6032 740.1362 l 20000 840 0 -1
-- -- 3.5647 748.1407 l 20000 860 0 -1
-- -- 3.6158 737.5586 l 20000 880 0 -1
-- -- 3.6838 723.9358 l 20000 900 0 -1
-- -- 3.6621 728.2395 l 20000 920 0 -1
-- -- 3.8525 692.2469 l 20000 940 0 -1
-- -- 3.6927 722.2020 l 20000 960 0 -1
Profiling throught FxT has not been enabled in StarPU runtime (configure
StarPU with --with-fxt)
#
# CHAMELEON 0.9.1,
/home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
# Nb threads: 68
# Nb GPUs: 0
# Nb mpi: 1
# PxQ: 1x1
# NB: 420
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
20000 20000 1 1.840 1449.35 +- 0.00

#---------------------
Worker stats:
CPU 0
301 task(s)
total: 2134.52 ms executing: 1723.89 ms sleeping: 375.34 ms overhead
35.28 ms
CPU 1
313 task(s)
total: 2135.19 ms executing: 1725.19 ms sleeping: 369.44 ms overhead
40.56 ms
CPU 2
329 task(s)
total: 2135.45 ms executing: 1709.23 ms sleeping: 384.55 ms overhead
41.67 ms
CPU 3
318 task(s)
total: 2135.63 ms executing: 1716.06 ms sleeping: 380.60 ms overhead
38.96 ms
CPU 4
331 task(s)
total: 2135.87 ms executing: 1713.67 ms sleeping: 379.02 ms overhead
43.18 ms
CPU 5
358 task(s)
total: 2136.08 ms executing: 1728.61 ms sleeping: 357.90 ms overhead
49.57 ms
CPU 6
317 task(s)
total: 2136.33 ms executing: 1694.58 ms sleeping: 392.22 ms overhead
49.54 ms
CPU 7
339 task(s)
total: 2136.56 ms executing: 1715.02 ms sleeping: 372.47 ms overhead
49.06 ms
CPU 8
321 task(s)
total: 2136.76 ms executing: 1726.37 ms sleeping: 370.03 ms overhead
40.36 ms
CPU 9
335 task(s)
total: 2136.95 ms executing: 1713.92 ms sleeping: 376.12 ms overhead
46.90 ms
CPU 10
338 task(s)
total: 2137.11 ms executing: 1717.04 ms sleeping: 370.09 ms overhead
49.98 ms
CPU 11
318 task(s)
total: 2137.33 ms executing: 1717.04 ms sleeping: 375.44 ms overhead
44.84 ms
CPU 12
320 task(s)
total: 2137.53 ms executing: 1689.57 ms sleeping: 405.22 ms overhead
42.74 ms
CPU 13
323 task(s)
total: 2137.73 ms executing: 1699.59 ms sleeping: 397.83 ms overhead
40.31 ms
CPU 14
330 task(s)
total: 2137.91 ms executing: 1711.72 ms sleeping: 385.29 ms overhead
40.90 ms
CPU 15
325 task(s)
total: 2138.08 ms executing: 1713.09 ms sleeping: 383.83 ms overhead
41.16 ms
CPU 16
330 task(s)
total: 2138.30 ms executing: 1701.77 ms sleeping: 397.50 ms overhead
39.03 ms
CPU 17
318 task(s)
total: 2138.51 ms executing: 1699.04 ms sleeping: 399.15 ms overhead
40.32 ms
CPU 18
323 task(s)
total: 2138.67 ms executing: 1715.05 ms sleeping: 383.60 ms overhead
40.01 ms
CPU 19
330 task(s)
total: 2138.86 ms executing: 1700.82 ms sleeping: 397.17 ms overhead
40.88 ms
CPU 20
328 task(s)
total: 2139.06 ms executing: 1699.11 ms sleeping: 395.21 ms overhead
44.74 ms
CPU 21
314 task(s)
total: 2139.27 ms executing: 1704.65 ms sleeping: 394.31 ms overhead
40.30 ms
CPU 22
327 task(s)
total: 2139.44 ms executing: 1722.93 ms sleeping: 372.05 ms overhead
44.47 ms
CPU 23
318 task(s)
total: 2139.63 ms executing: 1703.13 ms sleeping: 396.07 ms overhead
40.42 ms
CPU 24
318 task(s)
total: 2139.84 ms executing: 1712.86 ms sleeping: 387.20 ms overhead
39.78 ms
CPU 25
316 task(s)
total: 2140.00 ms executing: 1741.48 ms sleeping: 360.08 ms overhead
38.44 ms
CPU 26
330 task(s)
total: 2140.21 ms executing: 1727.55 ms sleeping: 368.57 ms overhead
44.10 ms
CPU 27
324 task(s)
total: 2140.39 ms executing: 1709.34 ms sleeping: 389.06 ms overhead
42.00 ms
CPU 28
338 task(s)
total: 2140.56 ms executing: 1719.64 ms sleeping: 376.69 ms overhead
44.24 ms
CPU 29
319 task(s)
total: 2140.75 ms executing: 1710.99 ms sleeping: 392.74 ms overhead
37.02 ms
CPU 30
324 task(s)
total: 2140.93 ms executing: 1705.84 ms sleeping: 390.84 ms overhead
44.26 ms
CPU 31
324 task(s)
total: 2141.13 ms executing: 1708.21 ms sleeping: 391.87 ms overhead
41.05 ms
CPU 32
325 task(s)
total: 2141.29 ms executing: 1717.27 ms sleeping: 383.52 ms overhead
40.50 ms
CPU 33
330 task(s)
total: 2141.47 ms executing: 1709.59 ms sleeping: 385.21 ms overhead
46.67 ms
CPU 34
316 task(s)
total: 2141.71 ms executing: 1736.90 ms sleeping: 363.31 ms overhead
41.49 ms
CPU 35
323 task(s)
total: 2141.91 ms executing: 1718.30 ms sleeping: 383.38 ms overhead
40.23 ms
CPU 36
312 task(s)
total: 2142.15 ms executing: 1707.05 ms sleeping: 398.93 ms overhead
36.17 ms
CPU 37
326 task(s)
total: 2142.33 ms executing: 1715.63 ms sleeping: 385.07 ms overhead
41.64 ms
CPU 38
316 task(s)
total: 2142.52 ms executing: 1712.01 ms sleeping: 389.72 ms overhead
40.79 ms
CPU 39
325 task(s)
total: 2142.70 ms executing: 1714.45 ms sleeping: 386.96 ms overhead
41.29 ms
CPU 40
323 task(s)
total: 2142.90 ms executing: 1708.08 ms sleeping: 394.50 ms overhead
40.32 ms
CPU 41
331 task(s)
total: 2143.12 ms executing: 1712.98 ms sleeping: 388.10 ms overhead
42.04 ms
CPU 42
325 task(s)
total: 2143.32 ms executing: 1716.30 ms sleeping: 385.83 ms overhead
41.19 ms
CPU 43
320 task(s)
total: 2143.51 ms executing: 1714.46 ms sleeping: 384.76 ms overhead
44.30 ms
CPU 44
336 task(s)
total: 2143.73 ms executing: 1705.32 ms sleeping: 393.78 ms overhead
44.63 ms
CPU 45
324 task(s)
total: 2143.88 ms executing: 1727.48 ms sleeping: 375.76 ms overhead
40.65 ms
CPU 46
322 task(s)
total: 2144.07 ms executing: 1738.34 ms sleeping: 364.39 ms overhead
41.33 ms
CPU 47
322 task(s)
total: 2144.25 ms executing: 1720.94 ms sleeping: 380.73 ms overhead
42.59 ms
CPU 48
320 task(s)
total: 2144.44 ms executing: 1692.07 ms sleeping: 407.64 ms overhead
44.73 ms
CPU 49
323 task(s)
total: 2144.64 ms executing: 1702.98 ms sleeping: 398.29 ms overhead
43.36 ms
CPU 50
334 task(s)
total: 2144.78 ms executing: 1721.57 ms sleeping: 381.88 ms overhead
41.34 ms
CPU 51
322 task(s)
total: 2145.03 ms executing: 1722.95 ms sleeping: 381.88 ms overhead
40.20 ms
CPU 52
315 task(s)
total: 2145.18 ms executing: 1698.10 ms sleeping: 408.28 ms overhead
38.80 ms
CPU 53
330 task(s)
total: 2145.38 ms executing: 1720.29 ms sleeping: 384.42 ms overhead
40.66 ms
CPU 54
311 task(s)
total: 2145.58 ms executing: 1706.82 ms sleeping: 399.80 ms overhead
38.97 ms
CPU 55
337 task(s)
total: 2145.77 ms executing: 1707.97 ms sleeping: 396.47 ms overhead
41.32 ms
CPU 56
309 task(s)
total: 2145.95 ms executing: 1705.46 ms sleeping: 404.70 ms overhead
35.80 ms
CPU 57
308 task(s)
total: 2146.18 ms executing: 1707.84 ms sleeping: 400.50 ms overhead
37.83 ms
CPU 58
316 task(s)
total: 2146.34 ms executing: 1715.02 ms sleeping: 392.94 ms overhead
38.38 ms
CPU 59
328 task(s)
total: 2146.51 ms executing: 1709.55 ms sleeping: 396.50 ms overhead
40.45 ms
CPU 60
312 task(s)
total: 2146.70 ms executing: 1714.29 ms sleeping: 394.00 ms overhead
38.41 ms
CPU 61
330 task(s)
total: 2146.89 ms executing: 1726.83 ms sleeping: 377.67 ms overhead
42.40 ms
CPU 62
321 task(s)
total: 2147.04 ms executing: 1713.30 ms sleeping: 394.19 ms overhead
39.54 ms
CPU 63
316 task(s)
total: 2147.27 ms executing: 1712.96 ms sleeping: 395.21 ms overhead
39.10 ms
CPU 64
329 task(s)
total: 2147.43 ms executing: 1702.46 ms sleeping: 406.43 ms overhead
38.53 ms
CPU 65
318 task(s)
total: 2147.64 ms executing: 1705.10 ms sleeping: 405.30 ms overhead
37.24 ms
CPU 66
312 task(s)
total: 2147.82 ms executing: 1712.24 ms sleeping: 402.58 ms overhead
33.00 ms
CPU 67
240 task(s)
total: 2147.99 ms executing: 1716.39 ms sleeping: 401.99 ms overhead
29.61 ms
#---------------------
PLASMA ERROR at 36 of plasma_tuning_init() in control/tuning.c:
PLASMA_TUNING_FILENAME not set

#---------------------
Worker stats:
CPU 0
224 task(s)
total: 2635.16 ms executing: 1650.45 ms sleeping: 919.94 ms overhead
64.76 ms
CPU 1
326 task(s)
total: 2667.98 ms executing: 1694.29 ms sleeping: 928.46 ms overhead
45.23 ms
CPU 2
320 task(s)
total: 2668.04 ms executing: 1677.69 ms sleeping: 948.26 ms overhead
42.09 ms
CPU 3
324 task(s)
total: 2668.08 ms executing: 1666.19 ms sleeping: 952.75 ms overhead
49.14 ms
CPU 4
326 task(s)
total: 2668.11 ms executing: 1703.74 ms sleeping: 923.34 ms overhead
41.04 ms
CPU 5
331 task(s)
total: 2668.14 ms executing: 1702.06 ms sleeping: 916.12 ms overhead
49.96 ms
CPU 6
331 task(s)
total: 2668.17 ms executing: 1686.35 ms sleeping: 927.81 ms overhead
54.02 ms
CPU 7
325 task(s)
total: 2668.20 ms executing: 1705.44 ms sleeping: 919.42 ms overhead
43.34 ms
CPU 8
336 task(s)
total: 2668.23 ms executing: 1676.53 ms sleeping: 954.19 ms overhead
37.50 ms
CPU 9
321 task(s)
total: 2668.26 ms executing: 1668.09 ms sleeping: 959.25 ms overhead
40.92 ms
CPU 10
314 task(s)
total: 2668.28 ms executing: 1678.23 ms sleeping: 947.93 ms overhead
42.13 ms
CPU 11
311 task(s)
total: 2668.31 ms executing: 1706.39 ms sleeping: 927.22 ms overhead
34.70 ms
CPU 12
322 task(s)
total: 2668.34 ms executing: 1691.21 ms sleeping: 936.12 ms overhead
41.01 ms
CPU 13
327 task(s)
total: 2668.37 ms executing: 1680.66 ms sleeping: 950.22 ms overhead
37.48 ms
CPU 14
350 task(s)
total: 2668.40 ms executing: 1700.47 ms sleeping: 922.47 ms overhead
45.46 ms
CPU 15
327 task(s)
total: 2668.42 ms executing: 1690.44 ms sleeping: 932.51 ms overhead
45.47 ms
CPU 16
335 task(s)
total: 2668.46 ms executing: 1679.87 ms sleeping: 914.97 ms overhead
73.61 ms
CPU 17
322 task(s)
total: 2668.49 ms executing: 1679.60 ms sleeping: 919.38 ms overhead
69.51 ms
CPU 18
320 task(s)
total: 2668.51 ms executing: 1664.93 ms sleeping: 940.98 ms overhead
62.61 ms
CPU 19
324 task(s)
total: 2668.54 ms executing: 1722.19 ms sleeping: 881.57 ms overhead
64.78 ms
CPU 20
310 task(s)
total: 2668.57 ms executing: 1668.24 ms sleeping: 934.98 ms overhead
65.35 ms
CPU 21
331 task(s)
total: 2668.60 ms executing: 1708.79 ms sleeping: 901.39 ms overhead
58.42 ms
CPU 22
322 task(s)
total: 2668.63 ms executing: 1672.79 ms sleeping: 934.30 ms overhead
61.54 ms
CPU 23
319 task(s)
total: 2668.66 ms executing: 1684.84 ms sleeping: 922.66 ms overhead
61.15 ms
CPU 24
321 task(s)
total: 2668.68 ms executing: 1662.19 ms sleeping: 936.73 ms overhead
69.77 ms
CPU 25
330 task(s)
total: 2668.71 ms executing: 1679.18 ms sleeping: 931.73 ms overhead
57.80 ms
CPU 26
321 task(s)
total: 2668.74 ms executing: 1693.45 ms sleeping: 904.13 ms overhead
71.16 ms
CPU 27
336 task(s)
total: 2668.75 ms executing: 1693.29 ms sleeping: 913.50 ms overhead
61.95 ms
CPU 28
324 task(s)
total: 2668.78 ms executing: 1672.28 ms sleeping: 934.03 ms overhead
62.47 ms
CPU 29
310 task(s)
total: 2668.81 ms executing: 1676.65 ms sleeping: 932.21 ms overhead
59.94 ms
CPU 30
331 task(s)
total: 2668.86 ms executing: 1695.94 ms sleeping: 913.10 ms overhead
59.82 ms
CPU 31
326 task(s)
total: 2668.89 ms executing: 1720.97 ms sleeping: 889.12 ms overhead
58.80 ms
CPU 32
329 task(s)
total: 2668.92 ms executing: 1702.06 ms sleeping: 892.67 ms overhead
74.19 ms
CPU 33
325 task(s)
total: 2668.95 ms executing: 1678.25 ms sleeping: 930.54 ms overhead
60.15 ms
CPU 34
317 task(s)
total: 2668.97 ms executing: 1685.84 ms sleeping: 914.71 ms overhead
68.42 ms
CPU 35
324 task(s)
total: 2669.04 ms executing: 1690.27 ms sleeping: 923.52 ms overhead
55.25 ms
CPU 36
316 task(s)
total: 2669.07 ms executing: 1686.69 ms sleeping: 930.96 ms overhead
51.42 ms
CPU 37
328 task(s)
total: 2669.10 ms executing: 1690.62 ms sleeping: 925.62 ms overhead
52.86 ms
CPU 38
330 task(s)
total: 2669.12 ms executing: 1695.72 ms sleeping: 923.74 ms overhead
49.66 ms
CPU 39
321 task(s)
total: 2669.15 ms executing: 1691.49 ms sleeping: 929.12 ms overhead
48.55 ms
CPU 40
318 task(s)
total: 2669.18 ms executing: 1705.33 ms sleeping: 922.35 ms overhead
41.50 ms
CPU 41
319 task(s)
total: 2669.21 ms executing: 1674.24 ms sleeping: 948.42 ms overhead
46.55 ms
CPU 42
325 task(s)
total: 2669.24 ms executing: 1682.64 ms sleeping: 933.10 ms overhead
53.50 ms
CPU 43
318 task(s)
total: 2669.26 ms executing: 1689.19 ms sleeping: 935.78 ms overhead
44.30 ms
CPU 44
321 task(s)
total: 2669.29 ms executing: 1698.83 ms sleeping: 928.27 ms overhead
42.20 ms
CPU 45
319 task(s)
total: 2669.32 ms executing: 1683.90 ms sleeping: 950.11 ms overhead
35.30 ms
CPU 46
326 task(s)
total: 2669.40 ms executing: 1706.95 ms sleeping: 914.86 ms overhead
47.58 ms
CPU 47
334 task(s)
total: 2669.43 ms executing: 1698.17 ms sleeping: 914.77 ms overhead
56.49 ms
CPU 48
338 task(s)
total: 2669.46 ms executing: 1688.77 ms sleeping: 933.11 ms overhead
47.58 ms
CPU 49
322 task(s)
total: 2669.49 ms executing: 1706.35 ms sleeping: 926.21 ms overhead
36.93 ms
CPU 50
322 task(s)
total: 2669.52 ms executing: 1699.82 ms sleeping: 919.03 ms overhead
50.67 ms
CPU 51
333 task(s)
total: 2669.54 ms executing: 1700.44 ms sleeping: 922.64 ms overhead
46.46 ms
CPU 52
314 task(s)
total: 2669.57 ms executing: 1680.91 ms sleeping: 942.75 ms overhead
45.90 ms
CPU 53
333 task(s)
total: 2669.60 ms executing: 1678.68 ms sleeping: 945.10 ms overhead
45.82 ms
CPU 54
323 task(s)
total: 2669.63 ms executing: 1674.98 ms sleeping: 950.72 ms overhead
43.93 ms
CPU 55
324 task(s)
total: 2669.65 ms executing: 1678.92 ms sleeping: 949.02 ms overhead
41.72 ms
CPU 56
330 task(s)
total: 2669.68 ms executing: 1696.42 ms sleeping: 932.21 ms overhead
41.06 ms
CPU 57
335 task(s)
total: 2669.71 ms executing: 1685.21 ms sleeping: 936.13 ms overhead
48.37 ms
CPU 58
318 task(s)
total: 2669.74 ms executing: 1683.69 ms sleeping: 939.28 ms overhead
46.77 ms
CPU 59
318 task(s)
total: 2669.77 ms executing: 1666.71 ms sleeping: 944.18 ms overhead
58.88 ms
CPU 60
322 task(s)
total: 2669.79 ms executing: 1658.63 ms sleeping: 964.01 ms overhead
47.15 ms
CPU 61
325 task(s)
total: 2669.84 ms executing: 1693.81 ms sleeping: 935.17 ms overhead
40.86 ms
CPU 62
326 task(s)
total: 2669.87 ms executing: 1682.90 ms sleeping: 937.65 ms overhead
49.32 ms
CPU 63
319 task(s)
total: 2669.90 ms executing: 1668.69 ms sleeping: 950.43 ms overhead
50.78 ms
CPU 64
324 task(s)
total: 2669.93 ms executing: 1686.91 ms sleeping: 930.64 ms overhead
52.38 ms
CPU 65
315 task(s)
total: 2669.95 ms executing: 1667.98 ms sleeping: 963.51 ms overhead
38.46 ms
CPU 66
319 task(s)
total: 2669.98 ms executing: 1678.48 ms sleeping: 951.12 ms overhead
40.38 ms
CPU 67
325 task(s)
total: 2670.01 ms executing: 1676.50 ms sleeping: 954.04 ms overhead
39.46 ms
#---------------------

Status Error Time Gflop/s uplo n nb padA zerocol

-- -- 4.0319 661.4349 l 20000 420 0 -1


> Le 19 sept. 2017 à 10:16, Olivier Aumage <olivier.aumage@inria.fr> a écrit :
>
> Dear Mawussi,
>
> I was curious to check the impact of "numactl -m 1" over hbw_malloc() for
> StarPU. I used hbw_malloc only for allocating the matrix, while "numactl -m
> 1" puts every data structure (even StarPU's tasks queues, data handles and
> synchronization objects) into the MCDRAM. Since, the MCDRAM has a higher
> bandwidth, but also a higher latency, I did not know whether the benefit of
> the higher bandwidth would be compensated by higher latency costs on
> synchronization objects.
>
> It turns out that the global "numactl -m 1" approach gives better results
> than the matrix-only hbw_malloc() approach. The best numactl result I
> obtained (with block size 448) is almost 200 GFlop/s higher than the best
> result with the matrix-only hbw_malloc():
>
> %----------------
> # CHAMELEON 0.9.1,
> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
> # Nb threads: 68
> # Nb GPUs: 0
> # Nb mpi: 1
> # PxQ: 1x1
> # NB: 448
> # IB: 32
> # eps: 1.110223e-16
> #
> # M N K/NRHS seconds Gflop/s Deviation
> 2000 2000 1 0.047 56.44 +- 2.11
> 4000 4000 1 0.101 211.59 +- 2.88
> 6000 6000 1 0.158 455.82 +- 5.43
> 8000 8000 1 0.222 770.25 +- 7.23
> 10000 10000 1 0.317 1052.57 +- 21.74
> 12000 12000 1 0.460 1251.44 +- 17.22
> 14000 14000 1 0.676 1353.75 +- 20.59
> 16000 16000 1 0.962 1420.32 +- 16.61
> 18000 18000 1 1.329 1462.70 +- 21.28
> 20000 20000 1 1.789 1490.79 +- 9.21
> %----------------
>
> Here are the test settings:
>
> - libhwloc:
> . version 1.11.7
> . no specific settings
>
> - StarPU:
> . Version: Subversion repository, branch trunk/, revision r22030
> . Compiler: Intel 17
> . Configure flags (I give a snippet from my GNU Bash script):
> %--------
> declare -a cfg
> cfg+=("--enable-shared")
> cfg+=("--disable-cuda")
> cfg+=("--disable-opencl")
> cfg+=("--disable-socl")
> cfg+=("--without-fxt")
> cfg+=("--disable-debug")
> cfg+=("--enable-fast")
> cfg+=("--disable-verbose")
> cfg+=("--disable-gcc-extensions")
> cfg+=("--disable-mpi-check")
> cfg+=("--disable-starpu-top")
> cfg+=("--disable-starpufft")
> cfg+=("--disable-build-doc")
> cfg+=("--disable-openmp")
> cfg+=("--disable-fortran")
> cfg+=("--disable-build-tests")
> cfg+=("--disable-build-examples")
> cfg+=("--enable-mpi")
> cfg+=("--enable-blas-lib=none")
> cfg+=("--disable-mlr")
> cfg+=("--enable-maxcpus=72")
> $STARPU_SRC_DIR/configure --prefix=$STARPU_INSTALL_DIR "${cfg[@]}"
> %--------
>
> - Chameleon settings:
> . compiler / mkl: Intel 17
> . cmake flags:
> cmake \
> -DCHAMELEON_ENABLE_EXAMPLE=OFF \
> -DBLAS_VERBOSE=ON \
> -DCHAMELEON_USE_CUDA=OFF \
> -DCHAMELEON_USE_MPI=ON \
> -DCHAMELEON_SIMULATION=OFF \
> -DCHAMELEON_SCHED_STARPU=ON \
> -DCMAKE_INSTALL_PREFIX=$HOME/Linalg/install-ch \
> -DCMAKE_C_COMPILER=icc \
> -DCMAKE_CXX_COMPILER=icpc \
> -DCMAKE_Fortran_COMPILER=ifort \
> -DCMAKE_BUILD_TYPE=Release \
> ../chameleon.git
>
> - Launch settings:
> STARPU_NCPU=68 STARPU_SCHED=lws numactl -m 1
> ./install-ch/lib/chameleon/timing/time_dpotrf_tile -N 2000:20000:2000 -b
> $448 --niter 10
>
> Best regards,
> --
> Olivier
>
>> Le 18 sept. 2017 à 23:09, Mawussi Zounon <mawussi.zounon@manchester.ac.uk>
>> a écrit :
>>
>> Dear Olivier,
>>
>> Thanks for taking your time to run the experiments.
>> It is quite comforting to see your results. It is quite
>> close to the numbers I got when using OpenMP.
>>
>> I re-run the tests again using the same nb as you,
>> but my performance didn't improve.
>>
>> I have notice two main differences:
>>
>> • I am using "numactl -m 1" while you are allocating the memory via
>> hbw_malloc(). From my experiments, both ways are equivalent in terms of
>> performance.
>> • I have just noticed that my KNL is currently configure in hybrid
>> mode: 8GB allocatable and 8GB en cache mode. I will advocate to set the
>> machine in flat mode then perform the experiment again.
>> Did you use any specific installation flag when installing StarPU on KNL?
>>
>> Best regards,
>> --Mawussi
>>
>> From: Olivier Aumage [olivier.aumage@inria.fr]
>> Sent: Monday, September 18, 2017 4:43 PM
>> To: Mawussi Zounon
>> Cc: Samuel Thibault; Jakub Sistek; Negin Bagherpour;
>> starpu-devel@lists.gforge.inria.fr
>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>
>> Dear Mawussi,
>>
>> Thanks for the tarball and the Google sheet. I will run it to try to
>> understand what is going on.
>>
>> Today I managed to run a dpotrf test from Chameleon with native StarPU. I
>> modified Chameleon to use hbw_malloc() for the matrix. The test was run on
>> the machine Frioul from CINES
>> (https://www.cines.fr/le-supercalculateur-frioul/), with the following
>> specs:
>> - Intel KNL 7250 68-core 1.4GHz, 16GB MCDRAM (mode quad, flat)
>> - Intel icc/ifort 17 + mkl 17
>> - StarPU scheduler: lws
>>
>> The best result I got is the following one, using a blocksize of 424,
>> reaching about 1.3TFlop/s for 20000x20000:
>> #---------------
>> #
>> # CHAMELEON 0.9.1,
>> /home/cvtoauma/Linalg/install-ch/lib/chameleon/timing/time_dpotrf_tile
>> # Nb threads: 68
>> # Nb GPUs: 0
>> # Nb mpi: 1
>> # PxQ: 1x1
>> # NB: 424
>> # IB: 32
>> # eps: 1.110223e-16
>> #
>> # M N K/NRHS seconds Gflop/s Deviation
>> 2000 2000 1 0.045 59.15 4.64
>> 4000 4000 1 0.093 229.03 3.86
>> 6000 6000 1 0.152 472.40 5.69
>> 8000 8000 1 0.230 742.36 9.10
>> 10000 10000 1 0.351 950.89 5.99
>> 12000 12000 1 0.537 1072.95 11.30
>> 14000 14000 1 0.783 1167.68 10.11
>> 16000 16000 1 1.108 1232.57 6.83
>> 18000 18000 1 1.543 1259.95 7.27
>> 20000 20000 1 2.037 1309.28 6.66
>> #---------------
>>
>> The results are highly sensitive to the block size. I also attach a plot
>> showing the performance for various block sizes. It seems I used smaller
>> blocks than in your tests. I do not know at this time whether this is the
>> main explanation for the performance difference with see or not.
>>
>> I will study the data you sent me and come back to you asap.
>>
>> Thanks again.
>> Best regards,
>> --
>> Olivier
>>
>>
>>
>>> Le 18 sept. 2017 à 17:15, Mawussi Zounon
>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>
>>> Dear Olivier,
>>> Please find attached the tarball of the StarPU version of plasma.
>>> the compilation should be straightforward, but the make.inc can be
>>> customized.
>>> The tests are in the directory "test".
>>> To test dgemm for example, you can run:
>>> STARPU_SCHED=strategy numactl -m 1 ./test dgemm --dim=size
>>> --nb=block_size, --test=n
>>>
>>> "numactl -m 1" to specify to allocate the memory in the MCDRAM
>>> "--dim" to specify the size of the problem
>>> "--nb" to specify the block size
>>> "--test=n" to disable testing, to save the benchmark time.
>>>
>>> In general ./test "routine_name" --help will give you more details.
>>>
>>> I simply downloaded Netlib LAPACK and linked it to MKL17 BLAS.
>>>
>>> I also shared a Google sheet with you to have an idea on the optimal NBs.
>>>
>>> Best regards,
>>>
>>> --Mawussi
>>>
>>> ________________________________________
>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>> Sent: Monday, September 18, 2017 9:55 AM
>>> To: Mawussi Zounon
>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>
>>> Hi Mawussi,
>>>
>>> I would like to try to reproduce your results with native StarPU and with
>>> LAPACK on KNL, to hopefully reduce a little bit the search space for
>>> possible explanations. Is it possible to have a tar of your current
>>> native StarPU port, with the 'configure' options you use, the environment
>>> variables (if any), and the testing program ?
>>>
>>> Regarding the 'LAPACK' test case on the dpotrf.KNL plot, did you use the
>>> current version on Netlib or is it a modified version? Could you give me
>>> the Makefile settings and environment variables if any?
>>>
>>> For allocating the matrix in MCDRAM, do you use hbw_malloc() from the
>>> MemKind library, or do you use some other means? Do you get similar,
>>> better or worse results when the MCDRAM is in 'cache' mode ?
>>>
>>> Thanks in advance.
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>> Le 15 sept. 2017 à 11:49, Mawussi Zounon
>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>
>>>> Dear Olivier,
>>>>
>>>> Thanks for the pointing out the difference KSTAR and a native StarPU.
>>>> I thought KSTAR performs a source to source compilation
>>>> by replacing completely all OpenMP features by StarPU equivalents.
>>>> But from your explanation, I have the impression that some OpenMP
>>>> features
>>>> remain in the executable produced by KSTAR, and this impacts the
>>>> behaviour of StarPU.
>>>> Please, can you provide us with a relevant reference on the KSTAR for a
>>>> better understanding
>>>> of how it works?
>>>>
>>>> The behaviour of the main thread seems a reasonable option to
>>>> investigate to improve the performance penalty
>>>> of the native StarPU code for small size matrices.
>>>>
>>>> On KNL, when using the MCDRAM, we used to observe some performance drop
>>>> even for MKL
>>>> from some matrix sizes. We have some potential explanations but we need
>>>> further experiments for confirmation.
>>>> However, even for reasonably small size matrices both the native StarPU
>>>> and the KSTAR generated code fail
>>>> to exploit the 68 cores efficiently; and their performance can even be
>>>> worse than LAPACK.
>>>> I think we should pay a close attention to the behaviour of StarPU on
>>>> the Intel self-hosted KNL.
>>>>
>>>> Regarding the question on the choice of the block size, I reported only
>>>> the auto-tuned results.
>>>> For each executable (OMP, STARPU, KSTAR), and each matrix size, we
>>>> perform
>>>> a sweep over a large "block size" space. In general, for a given matrix
>>>> size,
>>>> OMP, STARPU, and KSTAR achieve the highest performance for almost the
>>>> same "block size".
>>>> But for some routines, larger "block sizes" (less tasks) seem to benefit
>>>> the native StarPU.
>>>>
>>>> If it can help, find attached some results on an Intel Broadwell, a
>>>> 28-core NUMA node (14 cores per socket).
>>>> These results are very similar to the one obtained on the 20-core
>>>> Haswell. OPM, KSTAR and STARPU
>>>> have a similar asymptotic performance, while the native StarPU is
>>>> penalized for small size matrices.
>>>> At some extend, this confirm that the results on KNL have some serious
>>>> issues and it worth
>>>> investigating.
>>>>
>>>> Best regards,
>>>> --Mawussi
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: Olivier Aumage [olivier.aumage@inria.fr]
>>>> Sent: Thursday, September 14, 2017 2:51 PM
>>>> To: Mawussi Zounon
>>>> Cc: starpu-devel@lists.gforge.inria.fr; Samuel Thibault; Jakub Sistek;
>>>> Negin Bagherpour
>>>> Subject: Re: Performance issue of StarPU on Intel self-hosted KNL
>>>>
>>>> Hi Mawussi,
>>>>
>>>> The main difference between StarPU used alone and StarPU used through
>>>> the OpenMP compatibility layer concerns the 'main' thread of the
>>>> application:
>>>>
>>>> - When StarPU is used alone on a N-core machine, it launches N worker
>>>> threads bound on the N cores. (Thus there is a total of N+1 threads: 1
>>>> main application thread + N StarPU workers) The main application thread
>>>> is only used for submitting tasks to StarPU, it does not execute tasks
>>>> and it is not bound to a core unless the application binds it explicitly.
>>>>
>>>> - When StarPU is used through the OpenMP compatibility layer on a
>>>> N-core, it launches N-1 worker threads bounds to N-1 cores. (Thus there
>>>> is a total of N threads: 1 main application thread + N-1 StarPU workers)
>>>> The main application thread is used for submitting tasks _and_ for
>>>> participating in task execution while blocked on some barrier (e.g.: omp
>>>> taskwait, implicit barriers at end of parallel regions, ...). This
>>>> behaviour is required for compliance with the OpenMP execution model.
>>>>
>>>> I am not sure whether this difference is the unique cause of the
>>>> performance mismatch you observed, but this probably count for some
>>>> significant part at least, generally in favor of the StarPU+OpenMP. This
>>>> may be the main factor for the difference on small matrices, where the
>>>> main thread of the StarPU+OpenMP version can quickly give an hand to the
>>>> computation, while the main thread for the native StarPU version must
>>>> first be de-scheduled by the OS Kernel to leave its core to a worker
>>>> thread.
>>>>
>>>> On the other hand, the management of tasks for StarPU+OpenMP is more
>>>> expensive than the management of StarPU native tasks, due to the fact
>>>> that StarPU+OpenMP tasks may block while native StarPU tasks never
>>>> block. This management difference is therefore in favor of the native
>>>> StarPU version. This additional management cost is perhaps more
>>>> expensive on the KNL where the cores are much less advanced than the
>>>> regular Intel xeon cores.
>>>>
>>>> I do not know why the GFLOPS drop sharply for dgemm.KNL for matrix size
>>>> >= 14000. Only could think about some NUMA issue, but this should not be
>>>> the case since you say that the matrix is allocated in MCDRAM.
>>>>
>>>> I do not know either why the StarPU+OpenMP plot and the native StarPU
>>>> plot cross for the dpotrf.KNl test case.
>>>>
>>>> How do you choose the block size for each test sample ? Is it fixed for
>>>> all matrix sizes or is it computed from the matrix size? Do you observe
>>>> very different behaviour for other block sizes (e.g. fewer tasks on
>>>> large blocks, more tasks on small blocks, ...)?
>>>>
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>>> Le 14 sept. 2017 à 11:00, Mawussi Zounon
>>>>> <mawussi.zounon@manchester.ac.uk> a écrit :
>>>>>
>>>>>
>>>>> Dear all,
>>>>>
>>>>> Recently we developed a new version of the PLASMA library fully based
>>>>> on the OpenMP task-based
>>>>> runtime system. Our benchmark on both regular Intel Xeon (Haswell and
>>>>> Broadwell in the experiment) and Intel KNL, showed that the new OpenMP
>>>>> PLASMA has a performance comparable to the old version based on QUARK.
>>>>> This motivated us to extend the experiment to StarPU.
>>>>> To this end, on one hand we used KSTAR to generate a StarPU version of
>>>>> PLASMA. On another hand we developed another version of PLASMA
>>>>> (restricted to a few routines) based StarPU.
>>>>> It is important to note that the algorithms are the same; we simply
>>>>> replaced the task-based runtime system. Below are our findings:
>>>>>
>>>>> • On regular Intel Xeon architectures, PLASMA_OpenMP (OMP)
>>>>> PLASMA_KSTAR (KSTAR), and PLASMA_HAND_WRITTEN_STARPU (STARPU) have
>>>>> comparable performance, except for very small size matrices where our
>>>>> hand written StarPU version of PLASMA is outperformed by the generic
>>>>> KSTAR.
>>>>> • On the Intel Self-hosted KNL (68 cores), both our own STARPU
>>>>> version and KSTAR are significantly slow compared to OMP. But again
>>>>> our KSTAR and our StarPU version exhibited difference performance
>>>>> behaviour.
>>>>> I am wondering whether you can provide us with some hints or guidance
>>>>> to improve the performance of StarPU on the Intel KNL architecture.
>>>>> There might be some configuration options I missed. In addition I will
>>>>> be happy if you can help us to understand why our StarPU version seemed
>>>>> more penalized for small size matrices while KSTAR seems to be doing
>>>>> relatively better.
>>>>>
>>>>> Below some performance charts of dgemm and Cholesky (dpotrf) to
>>>>> illustrate our observations:
>>>>> <dgemm_haswell_rutimes.png>
>>>>>
>>>>>
>>>>> <dpotrf_haswell_rutimes.png>
>>>>>
>>>>> <dgemm_knl_rutimes.png>
>>>>>
>>>>> <dpotrf_knl_rutimes.png>
>>>>>
>>>>> For the experiments on KNL, the matrices have been allocated in the
>>>>> MCDRAM.
>>>>>
>>>>> I am looking forward to hearing from you.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --Mawussi
>>>>
>>>> <dgemm_broadwell_runtimes.png><dpotrf_broadwell_runtimes.png>
>>>
>>> <plasma_starpu.tar>
>
> _______________________________________________
> Starpu-devel mailing list
> Starpu-devel@lists.gforge.inria.fr
> https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel




Archives gérées par MHonArc 2.6.19+.

Haut de le page