starpu-devel - Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes

From: Yizhou Qian <adncat@stanford.edu>
To: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
Subject: Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes
Date: Wed, 6 Nov 2019 06:58:38 +0000
Accept-language: en-US
Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=stanford.edu; dmarc=pass action=none header.from=stanford.edu; dkim=pass header.d=stanford.edu; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=v1Wn1Is0pdlwDnzje7RddreTF9hwG8FcINB/rXRV/9c=; b=g4jnwXT7DN+84bnX9tx/tyjfnTHTKsfAOz2yCN3w/QOtho4tFsEoThszP16/DWdMAI/xE3RDcIFFQMYAvYlcLafRn8djo8n9uT8Ha4W6d/LabEOWuMblOhB2fQ+toC+jfhUQWQUsOk0qCEN+hWqfo+V4t3paLaiVSPlY4R0ZUucTigRvZWjOdPaXWRsUs7NoI97XTtkbVdi19kqUeoJIV23pBTovm+5qJB1DjB533mTmdl/+8rnCLBn/JtjrDaeyJL5FpXReho+03U4xLvGRIVairUmiZm7NkWdTfNC8GUwwIX35sM5aY82HLS6tgygo5bTTDq2ZYvCzsEOmFhAa5w==
Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=nwxwAMUj5qf5HCUEKltjygvYMyXjfPdFBMdJ+KtvIYOJiTzNUmfBJKEe8E6aBqOiXla5BrkrJnFxv4/RqP38s1/63oaN9HSagKrDiOXDyYXGSiTqVbol6Uh9yJZOqCarKa60hrUvGRAmhFP2ZcI55cz1shIYULbCtZc+uzUC4YoEZN7TPKZNFi8P4juirqwimtc7DFAvyvR9SQDxE8f8xkizaOCJ0Yidqr72/TDcQpp+1z7hlP23xFT5MBMNTkN/K9dXev//kbR8FLE4Ha+5JzWIj2GBLQpB1zoqYii2DAGRzMfo+lxF+2jUEgtaGdKUC7n8ZEDRYdSEsov6GhHVsA==
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=adncat@stanford.edu; spf=Pass smtp.mailfrom=adncat@stanford.edu; spf=None smtp.helo=postmaster@mx0b-00000d04.pphosted.com
Ironport-phdr: 9a23:V82xvxN3f/1aWoCbHaol6mtUPXoX/o7sNwtQ0KIMzox0I/79rarrMEGX3/hxlliBBdydt6sfzbOP7eu5ATRIyK3CmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+KPjrFY7OlcS30P2594HObwlSizexfL1/IA+roQnMt8QajpZuJrotxhDUvnZGZuNayH9yK1mOhRj8/MCw/JBi8yRUpf0s8tNLXLv5caolU7FWFSwqPG8p6sLlsxnDVhaP6WAHUmoKiBpIAhPK4w/8U5zsryb1rOt92C2dPc3rUbA5XCmp4ql3RBP0jioMKjg0+3zVhMNtlqJWuBKvqQJizY7Ibo+bN/R+caHcfdwGSmVMRdxeWzBdDo6mc4cDE/QNMOBFpIf9vVsOqh6+CBGuC+Puyz5Ihnj23bAn2Oo6EAHJxgogFM8JvXvOsdr1MrsdXvqpzKTT1jXDc+lZ2THz6IjPaBAuvOuAUqxtfsrM0EQiER7OgFaIqYH9Ij+Y2ecAv3KG4+dhW++jkXMrpgF/rzS12MshhInEipoLxl3F6Sl13YM4KNy3RUN5ZNOpEJhduiKGO4ZzXMwtWHpntDo/x7AFpZK0ZzYFxZEkyhLCcPOKcI2F7g/hWeueOzh1gWlqdb29ihqu6USgxPPzW8qo3FtPqydJjNbBu3QD2hHW5cWKSeZy80mk1DuPyg/e7vpLLEU6mKXHMJEt3rg9nYcJv0vZBC/5gkD2gbeWdko6/uio7PzqYrDpp5OALIB4kx3yPrgylsCjHeg3LxQCUmeB9eSkzL3j/Ur5QK5WjvIoj6bVqozVJcMepqKhAg9V1Jgs6wqnAju739kVnmMLIE9EdR+JlYTlJlHDLf7iAfuhjVmhkC9nx/XcMb3gBpXNIGLDkLDkfbtl8UFT1QwzwsxF6JJIEbwBO+7zVVX3tNzWCR85KRG7z/z5B9pgy4MSQXiPDbOBMKPOrV+I4foiI/KQZIAPojb9M+Ul6+fzgnAnh18SY62p0IATaHC5BfRmP16ZbWDjgtcPFmcKpAU+Q/LwhF2DVz5TfXeyULgm6jE1EoL1RbvEE8q2nLWbxDr+EpBIa2RuDlGXDWyueIuDQfgBLiOUOM5o1DIeH/D1UJMozwmz8QP31bdjBu7V4TED853t08J66qvSkwsz/Hp6FZLO/XuKSjRFn3EIQCJ+5aF2pwQpw1uF2qFkq+dEHNpd4OlFFAo2KMiPnKRBF9nuV1eZLZ+yQ1G8T4D+WG1jfpcK29YLJn1FNZC6lBmag3ixH7YTmbuRCNo5/r+OhyGgdfY48G7P0ewat3djR8JOMWO8gasmq1rIG4fPnUKDm+CneblOhXeQplfG9nKHuQRjaCA1UajBWipPNE7G9d68vhuaF+eiUehhKhNBztWeJ6cMYdrs3w1L
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Thanks! I tried using 4 nodes with 8 and 24 cores on each node respectively with a matrix of size 1000*20. However the result shows that using 24 cores on each node is slower than using 8 cores on each node. Is this normal or I have missed something when running the code? The exact command is:

STARPU_SCHED=dmdas mpirun ./starpu-1.3.2/mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((1000*20)) -nblocks 20

The outputs are shown below:

Result with 4 nodes and 8 cores per node:

[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 24 != (current) 8), recalibrating...

[starpu][compare_value_and_recalibrate] ... done

[starpu][check_bus_config_file] No performance model for the bus, calibrating...

[starpu][check_bus_config_file] ... done

[starpu][check_bus_config_file] No performance model for the bus, calibrating...

[starpu][check_bus_config_file] ... done

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.

size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.

size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2

[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_22 is not calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 9 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_21 on cpu0_impl0 (Comb0): 80135.622000 vs average 47935.153600, 5 such errors against $[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 2473644.139000 vs average 5214670.073000, 1 such errors agai$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 4453137.907000 vs average 1652110.597000, 1 such errors agai$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 4959215.396000 vs average 1844665.254000, 1 such errors agai$[1572985659.293172] [sh-107-41:205341:0] mpool.c:38 UCX WARN object 0x19c52c0 was not returned to mpool ucp_requests

Computation time (in ms): 31129.52

Synthetic GFlops : 85.66

Results with 4 nodes and 24 cores per node:

[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 8 != (current) 24), recalibrating...

[starpu][compare_value_and_recalibrate] ... done

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.

size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.

size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2

[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.

size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2

[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 1 m$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 2187839.665000 vs average 209929.500000, 1 such errors again$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_22 is not calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_22 is not calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[1572987914.829063] [sh-107-45:226943:0] mpool.c:38 UCX WARN object 0x1274b80 was not returned to mpool ucp_requests

Computation time (in ms): 41708.27

Synthetic GFlops : 63.94

Thanks again!

Best,

Yizhou

From: Samuel Thibault <samuel.thibault@inria.fr>
Sent: Sunday, October 27, 2019 8:40 AM
To: Yizhou Qian <adncat@stanford.edu>
Subject: Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes

Hello,

Just FYI, if you want detailed response, you should mail starpu-devel
so Nathalie can also answer, I don't necessarily have the time to delve
into details. Also, that would help Nathalie to write useful
documentation for all users.

Yizhou Qian, le sam. 26 oct. 2019 23:15:51 +0000, a ecrit:
> However, somehow the program is in serialized state
> [starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level =
> MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a
> time.

This only means the MPI calls are serialized. Please read about
MPI_THREAD_SERIALIZED, that's a limitation of your MPI implementation.
StarPU uses serialized communications anyway, but it means you can't do
MPI yourself in your application.

> and there was some kind of out of memory issue?

It seems so:

> slurmstepd: error: Detected 11 oom-kill event(s) in step 53469501.batch cgroup.
> Some of your processes may have been killed by the cgroup out-of-memory
> handler.

I don't see why that could happen, 57600 is not particularly large. Just
in case, you could try smaller values.

> size: 57600 - nblocks: 60 - dblocksx: 8 - dblocksy: 6

Are you really running on 48 nodes?

Are you perhaps telling MPI to run one process per core? Then it's no
wonder it fills the memory: that version of MPI cholesky allocates the
whole matrix on each node. To make each node only allocate its piece of
the matrix, use mpi_cholesky_distributed. That being said, you really
do not want to run one process per core, that would completely kill the
usefulness of StarPU. Have MPI run one process per machine, so StarPU
can handle intra-machine parallelism.

Samuel

Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Yizhou Qian, 06/11/2019
- Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Samuel Thibault, 20/11/2019
  - Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Samuel Thibault, 20/11/2019

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes