Accéder au contenu.
Menu Sympa

starpu-devel - [starpu-devel] Suspicious check in the cuda memory allocation routine

Objet : Developers list for StarPU

Archives de la liste

[starpu-devel] Suspicious check in the cuda memory allocation routine


Chronologique Discussions 
  • From: David <dstrelak@cnb.csic.es>
  • To: starpu-devel@inria.fr
  • Subject: [starpu-devel] Suspicious check in the cuda memory allocation routine
  • Date: Wed, 12 May 2021 13:04:43 +0200
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=dstrelak@cnb.csic.es; spf=Pass smtp.mailfrom=dstrelak@cnb.csic.es; spf=None smtp.helo=postmaster@cel1.sgai.csic.es
  • Ironport-hdrordr: A9a23:/FXNba3YeVip0tAe5rtsYwqjBKkkLtp133Aq2lEZdPUnSL37qy nOpoV56faaslgssR0b9OxoW5PhfZq/z/FICOAqVN/IYOCMggeVxe9Zh7cKjweAJxHD
  • Ironport-phdr: A9a23:2DhEkxd+dK6toQUJbyXHzF11lGM+H97LVj580XLHo4xHfqnrxZn+JkuXvawr0AaYG9+Gsrkd0rWempujcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS47xaFLIv3K98yMZFAnhOgppPOT1HZPZg9iq2+yo9JDffgFFiCCjbb5yMRm6ohjdutQUjIB/Nqs/1xzFr2dSde9L321oP1WTnxj95se04pFu9jlbtuwi+cBdT6j0Zrw0QrNEAjsoNWA1/9DrugLYTQST/HscU34ZnQRODgPY8Rz1RJbxsi/9tupgxCmXOND9QL4oVTi+6apgVRjnhj0HNz4462HXis1wjKRUoB64uxFyzY3ZbJ+MOPZgY6/WYNcWSGRdUspNUiBMBJ63YYkSAOobJetWsYnzqUYNoxWwCwajC//gxyRSiXPqx6A3yfgtHA/E0QEmAtkAsG7UrNLwNKoKX+661rfIzTTeYPhL3jry9ZPIcgw7of6SWbJ/a8nRxFIzGAPDlVqcs5DqPyiU1usRqWSU8fFgWPuphmU6pA5/viKhyd0wionVmI0V0FbE+D12zoorKtO1SEx2bcCmHZZRtCyXNIR4T8MiTWxnpSs3xbILtJq5cSUKyJoqxBDRZvKIfoSU/h7uV/udLzd8iXxrfr+0mhi88U+lyuLmV8m01k5HritDktnWt3ACzQbf6sadSvdl8Ueh3jGP1w/X6u5aO0w7ia3bK5snz7UtlZQTqVzOEjL3lUnrlqOaa1go9vWy5+j6frnqvJ+ROo1shgz8MKkigNKzDfo7PwQUQWSW+eqx2Kfj8EHlRrhBk+c4nbPDsJ/AIMQWvq65DBFR0oYk8xuwEymp0NAFkXUdN1JFdwiIj4nzN17SO/D4DOuwj06ynzdw3/zGP7vhDYvRLnXbjbvsfKtx51RBxAYu0NxT/Y9YBq0bLP/zWEL9rNnYAQU4MwywzebnEtJ91oYGVG2SGa+ZLLnSvkGM5u01IOmBf5MauDDmJPQ/+/Huln45lkMHcaa3xpsbcGq4Eeh+I0WFfXrshc8MEXwWvgUkVuzqkECCXSdOaHmsQaIz+DU7BZm9DYbDQ4CtmKaO0D26Hp1QfGBGC0qDHW3md4WeCL8wb3eJPsZ7iiFBWbW/RosJ1Be0tQa8xaA0APDT/3g0vIzi2JAh6/bPjxw2/D9cBMKXlWqWTmoyk3heFGx+57x2vUEokwTL6qN/mfENToQLjxumegw8P9jX1O1xTd3pCFupljahQ1uiG5O9DjgwCNkqwtRIbl0vQ71KazjI2SDsCKQUnPqGH85smp8=
  • Ironport-sdr: toez33fwa5RWLWdPwaRRIj+hm/EcL81AUxyk44PcO6vISHD5tu0o/0L8UE3DBV7ZBUNH6YwfX6 5YDsApwQNeBX5m5M3UM7gOHod3IqHdT7uWZYDiSWRsK3pQ60TOSIgR7YExAm6SUYUBdNDOGnL7 zhlK98/0r0r2NxF62DJXYm3InHcrt5FEhsoSm9/WPaegPXaE2z8S0mxTLkGl30QNB0wYcdBL9K WsCxDA7uw66HABc7cyLZYvxP6JNSQYbi+LH64Qm5XEr+oVGT3J+PZNClrdbyDXq+m3AS73dHNk anU=

Dear sir, madam,


we're currently testing StarPU v. 1.3.7.

We have noticed that when StarPU tries to allocate memory on a cuda, it performs the following check:

if (status == cudaSuccess && cuda_mem_free < (size*2))

{
    addr = 0;
}

(see https://gitlab.inria.fr/starpu/starpu/-/blame/starpu-1.3.7/src/drivers/cuda/driver_cuda.c#L1757)


Notice that StarPU is checking if twice as much memory than it's being requested is currently available.

In our use case, we tried to allocate roughly 2GB block on a card with 5,9GB of memory, but less than 4GB were available at that time, so this check succeeded and hence no memory was allocated.

We then saw the following message:
[starpu][_starpu_memory_reclaim_generic] Not enough memory left on node CUDA 0. Your application data set seems too huge to fit on the device, StarPU will cope by trying to purge 4313 MiB out. This message will not be printed again for further purges

And the application got stuck (kept polling) at the memory allocation routine (our implementation of the starpu_data_interface_ops.allocate_data_on_node, at the end of this email for reference).

Is the '2' in the 'cuda_mem_free < (size*2)' there on purpose? We did not find anything related to this in the documentation.

Thanks for the answer.


KR,

David Strelak


------------------------

template<typename T>
static starpu_ssize_t payload_allocate_data_on_node(void *data_interface, unsigned node)
{
  auto *interface = reinterpret_cast<umpalumpa::data::Payload<T> *>(data_interface);

  starpu_ssize_t requested_memory = interface->dataInfo.bytes;
  void *data = "reinterpret_cast<void" *>(starpu_malloc_on_node(node, requested_memory));
  if (nullptr == data) return -ENOMEM;
  /* update the data properly in consequence */
  interface->data = "data;   return requested_memory;
}






Archives gérées par MHonArc 2.6.19+.

Haut de le page