Objet : Developers list for StarPU
Archives de la liste
- From: Yizhou Qian <adncat@stanford.edu>
- To: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
- Subject: Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes
- Date: Wed, 6 Nov 2019 06:58:38 +0000
- Accept-language: en-US
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=stanford.edu; dmarc=pass action=none header.from=stanford.edu; dkim=pass header.d=stanford.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=v1Wn1Is0pdlwDnzje7RddreTF9hwG8FcINB/rXRV/9c=; b=g4jnwXT7DN+84bnX9tx/tyjfnTHTKsfAOz2yCN3w/QOtho4tFsEoThszP16/DWdMAI/xE3RDcIFFQMYAvYlcLafRn8djo8n9uT8Ha4W6d/LabEOWuMblOhB2fQ+toC+jfhUQWQUsOk0qCEN+hWqfo+V4t3paLaiVSPlY4R0ZUucTigRvZWjOdPaXWRsUs7NoI97XTtkbVdi19kqUeoJIV23pBTovm+5qJB1DjB533mTmdl/+8rnCLBn/JtjrDaeyJL5FpXReho+03U4xLvGRIVairUmiZm7NkWdTfNC8GUwwIX35sM5aY82HLS6tgygo5bTTDq2ZYvCzsEOmFhAa5w==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=nwxwAMUj5qf5HCUEKltjygvYMyXjfPdFBMdJ+KtvIYOJiTzNUmfBJKEe8E6aBqOiXla5BrkrJnFxv4/RqP38s1/63oaN9HSagKrDiOXDyYXGSiTqVbol6Uh9yJZOqCarKa60hrUvGRAmhFP2ZcI55cz1shIYULbCtZc+uzUC4YoEZN7TPKZNFi8P4juirqwimtc7DFAvyvR9SQDxE8f8xkizaOCJ0Yidqr72/TDcQpp+1z7hlP23xFT5MBMNTkN/K9dXev//kbR8FLE4Ha+5JzWIj2GBLQpB1zoqYii2DAGRzMfo+lxF+2jUEgtaGdKUC7n8ZEDRYdSEsov6GhHVsA==
- Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=adncat@stanford.edu; spf=Pass smtp.mailfrom=adncat@stanford.edu; spf=None smtp.helo=postmaster@mx0b-00000d04.pphosted.com
- Ironport-phdr: 9a23:V82xvxN3f/1aWoCbHaol6mtUPXoX/o7sNwtQ0KIMzox0I/79rarrMEGX3/hxlliBBdydt6sfzbOP7eu5ATRIyK3CmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+KPjrFY7OlcS30P2594HObwlSizexfL1/IA+roQnMt8QajpZuJrotxhDUvnZGZuNayH9yK1mOhRj8/MCw/JBi8yRUpf0s8tNLXLv5caolU7FWFSwqPG8p6sLlsxnDVhaP6WAHUmoKiBpIAhPK4w/8U5zsryb1rOt92C2dPc3rUbA5XCmp4ql3RBP0jioMKjg0+3zVhMNtlqJWuBKvqQJizY7Ibo+bN/R+caHcfdwGSmVMRdxeWzBdDo6mc4cDE/QNMOBFpIf9vVsOqh6+CBGuC+Puyz5Ihnj23bAn2Oo6EAHJxgogFM8JvXvOsdr1MrsdXvqpzKTT1jXDc+lZ2THz6IjPaBAuvOuAUqxtfsrM0EQiER7OgFaIqYH9Ij+Y2ecAv3KG4+dhW++jkXMrpgF/rzS12MshhInEipoLxl3F6Sl13YM4KNy3RUN5ZNOpEJhduiKGO4ZzXMwtWHpntDo/x7AFpZK0ZzYFxZEkyhLCcPOKcI2F7g/hWeueOzh1gWlqdb29ihqu6USgxPPzW8qo3FtPqydJjNbBu3QD2hHW5cWKSeZy80mk1DuPyg/e7vpLLEU6mKXHMJEt3rg9nYcJv0vZBC/5gkD2gbeWdko6/uio7PzqYrDpp5OALIB4kx3yPrgylsCjHeg3LxQCUmeB9eSkzL3j/Ur5QK5WjvIoj6bVqozVJcMepqKhAg9V1Jgs6wqnAju739kVnmMLIE9EdR+JlYTlJlHDLf7iAfuhjVmhkC9nx/XcMb3gBpXNIGLDkLDkfbtl8UFT1QwzwsxF6JJIEbwBO+7zVVX3tNzWCR85KRG7z/z5B9pgy4MSQXiPDbOBMKPOrV+I4foiI/KQZIAPojb9M+Ul6+fzgnAnh18SY62p0IATaHC5BfRmP16ZbWDjgtcPFmcKpAU+Q/LwhF2DVz5TfXeyULgm6jE1EoL1RbvEE8q2nLWbxDr+EpBIa2RuDlGXDWyueIuDQfgBLiOUOM5o1DIeH/D1UJMozwmz8QP31bdjBu7V4TED853t08J66qvSkwsz/Hp6FZLO/XuKSjRFn3EIQCJ+5aF2pwQpw1uF2qFkq+dEHNpd4OlFFAo2KMiPnKRBF9nuV1eZLZ+yQ1G8T4D+WG1jfpcK29YLJn1FNZC6lBmag3ixH7YTmbuRCNo5/r+OhyGgdfY48G7P0ewat3djR8JOMWO8gasmq1rIG4fPnUKDm+CneblOhXeQplfG9nKHuQRjaCA1UajBWipPNE7G9d68vhuaF+eiUehhKhNBztWeJ6cMYdrs3w1L
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Thanks! I tried using 4 nodes with 8 and 24 cores on each node respectively with a matrix of size 1000*20. However the result shows that using 24 cores on each node is slower than using 8 cores on each node. Is this normal or I have missed something when running
the code? The exact command is:
STARPU_SCHED=dmdas mpirun ./starpu-1.3.2/mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((1000*20)) -nblocks 20
The outputs are shown below:
Result with 4 nodes and 8 cores per node:
[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 24 != (current) 8), recalibrating...
[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 24 != (current) 8), recalibrating...
[starpu][compare_value_and_recalibrate] ... done
[starpu][compare_value_and_recalibrate] ... done
[starpu][check_bus_config_file] No performance model for the bus, calibrating...
[starpu][check_bus_config_file] ... done
[starpu][check_bus_config_file] No performance model for the bus, calibrating...
[starpu][check_bus_config_file] ... done
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_22 is not
calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 9 m$[starpu][_starpu_history_based_job_expected_perf]
Warning: model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_21 on cpu0_impl0 (Comb0): 80135.622000 vs average 47935.153600,
5 such errors against $[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 2473644.139000 vs average 5214670.073000, 1 such errors agai$[starpu][_starpu_update_perfmodel_history] Too big deviation for
model chol_model_11 on cpu0_impl0 (Comb0): 4453137.907000 vs average 1652110.597000, 1 such errors agai$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11 on cpu0_impl0 (Comb0): 4959215.396000 vs average 1844665.254000, 1
such errors agai$[1572985659.293172] [sh-107-41:205341:0] mpool.c:38 UCX WARN object 0x19c52c0 was not returned to mpool ucp_requests
Computation time (in ms): 31129.52
Synthetic GFlops : 85.66
Results with 4 nodes and 24 cores per node:
[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 8 != (current) 24), recalibrating...
[starpu][compare_value_and_recalibrate] Current configuration does not match the bus performance model (CPUS: (stored) 8 != (current) 24), recalibrating...
[starpu][compare_value_and_recalibrate] ... done
[starpu][compare_value_and_recalibrate] ... done
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
[starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
size: 20000 - nblocks: 20 - dblocksx: 2 - dblocksy: 2
[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000 footprint a45f87dc (only 1 m$[starpu][_starpu_update_perfmodel_history] Too big deviation for model chol_model_11
on cpu0_impl0 (Comb0): 2187839.665000 vs average 209929.500000, 1 such errors again$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_22 is not calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf]
Warning: model chol_model_22 is not calibrated enough for cpu0_impl0 (Comb0) size 12000000 footprint 67bed6a9 (only 0 $[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_11 is not calibrated enough for cpu0_impl0 (Comb0) size 4000000
footprint a45f87dc (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning: model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[starpu][_starpu_history_based_job_expected_perf] Warning:
model chol_model_21 is not calibrated enough for cpu0_impl0 (Comb0) size 8000000 footprint d4b37584 (only 0 m$[1572987914.829063] [sh-107-45:226943:0] mpool.c:38 UCX WARN object 0x1274b80 was not returned to mpool ucp_requests
Computation time (in ms): 41708.27
Synthetic GFlops : 63.94
Thanks again!
Best,
Yizhou
From: Samuel Thibault <samuel.thibault@inria.fr>
Sent: Sunday, October 27, 2019 8:40 AM
To: Yizhou Qian <adncat@stanford.edu>
Subject: Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes
Sent: Sunday, October 27, 2019 8:40 AM
To: Yizhou Qian <adncat@stanford.edu>
Subject: Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes
Hello,
Just FYI, if you want detailed response, you should mail starpu-devel
so Nathalie can also answer, I don't necessarily have the time to delve
into details. Also, that would help Nathalie to write useful
documentation for all users.
Yizhou Qian, le sam. 26 oct. 2019 23:15:51 +0000, a ecrit:
> However, somehow the program is in serialized state
> [starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level =
> MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a
> time.
This only means the MPI calls are serialized. Please read about
MPI_THREAD_SERIALIZED, that's a limitation of your MPI implementation.
StarPU uses serialized communications anyway, but it means you can't do
MPI yourself in your application.
> and there was some kind of out of memory issue?
It seems so:
> slurmstepd: error: Detected 11 oom-kill event(s) in step 53469501.batch cgroup.
> Some of your processes may have been killed by the cgroup out-of-memory
> handler.
I don't see why that could happen, 57600 is not particularly large. Just
in case, you could try smaller values.
> size: 57600 - nblocks: 60 - dblocksx: 8 - dblocksy: 6
Are you really running on 48 nodes?
Are you perhaps telling MPI to run one process per core? Then it's no
wonder it fills the memory: that version of MPI cholesky allocates the
whole matrix on each node. To make each node only allocate its piece of
the matrix, use mpi_cholesky_distributed. That being said, you really
do not want to run one process per core, that would completely kill the
usefulness of StarPU. Have MPI run one process per machine, so StarPU
can handle intra-machine parallelism.
Samuel
Just FYI, if you want detailed response, you should mail starpu-devel
so Nathalie can also answer, I don't necessarily have the time to delve
into details. Also, that would help Nathalie to write useful
documentation for all users.
Yizhou Qian, le sam. 26 oct. 2019 23:15:51 +0000, a ecrit:
> However, somehow the program is in serialized state
> [starpu][_starpu_mpi_print_thread_level_support] MPI_Init_thread level =
> MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a
> time.
This only means the MPI calls are serialized. Please read about
MPI_THREAD_SERIALIZED, that's a limitation of your MPI implementation.
StarPU uses serialized communications anyway, but it means you can't do
MPI yourself in your application.
> and there was some kind of out of memory issue?
It seems so:
> slurmstepd: error: Detected 11 oom-kill event(s) in step 53469501.batch cgroup.
> Some of your processes may have been killed by the cgroup out-of-memory
> handler.
I don't see why that could happen, 57600 is not particularly large. Just
in case, you could try smaller values.
> size: 57600 - nblocks: 60 - dblocksx: 8 - dblocksy: 6
Are you really running on 48 nodes?
Are you perhaps telling MPI to run one process per core? Then it's no
wonder it fills the memory: that version of MPI cholesky allocates the
whole matrix on each node. To make each node only allocate its piece of
the matrix, use mpi_cholesky_distributed. That being said, you really
do not want to run one process per core, that would completely kill the
usefulness of StarPU. Have MPI run one process per machine, so StarPU
can handle intra-machine parallelism.
Samuel
- Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Yizhou Qian, 06/11/2019
- Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Samuel Thibault, 20/11/2019
- Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Samuel Thibault, 20/11/2019
- Re: [Starpu-devel] Running Cholesky_implicit on 2 gpu nodes, Samuel Thibault, 20/11/2019
Archives gérées par MHonArc 2.6.19+.