Accéder au contenu.
Menu Sympa

starpu-devel - Re: [starpu-devel] Suspicious behavior of StarPU dmdar scheduler

Objet : Developers list for StarPU

Archives de la liste

Re: [starpu-devel] Suspicious behavior of StarPU dmdar scheduler


Chronologique Discussions 
  • From: David <dstrelak@cnb.csic.es>
  • To: starpu-devel@inria.fr
  • Cc: Jiri Filipovic <fila@ics.muni.cz>
  • Subject: Re: [starpu-devel] Suspicious behavior of StarPU dmdar scheduler
  • Date: Mon, 18 Apr 2022 10:28:41 +0200
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=dstrelak@cnb.csic.es; spf=Pass smtp.mailfrom=dstrelak@cnb.csic.es; spf=None smtp.helo=postmaster@cel1.sgai.csic.es
  • Ironport-data: A9a23:im4Q0qDxvEJx0xVW/7Phw5YqxClBgxIJ4kV8jS/XYbTApGlz0z0Em DYbWmCHMvqPZGCkLd0jPoS19RhT7JLWyIBnTANkpHpgZkwRpJueD7x1DKtS0wC6dZSfER09v 63yTvGZdJhcoqr0/E/1WlTZhSAgk/nOHNIQMcacUsxLbVYMpBwJ1FQyw4bVvqYy2YLjW1/U6 YuryyHiEAbNNwBcbTp8B52r8HuDjNyq0N/PlgVjDRzjlAa2e0g9VPrzF4noR5fLatA88tqBe gr25OrRElXxpE5xV4z/wt4XRWVRKlLaFVDmZnO7wMFOiDAazsA5+v5T2Pbx9S67IthG9jx84 IwliHC+desmFv3jkuEHfhNqKg1VPbZAxYDCAyCS4OXGmiUqc1O0qxlvJFozIZUZ/KB8GmBFs /EDQNwPRknbwbvumPTiEbEq3azPL+GyVG8bkn170SvUCf8laZvFSePB/t5Tmjor7ixLNa+CO ZNIMGM1BPjGS0JJfUtQMc5mpeSH2mv0diBFjni0jKVitgA/yyQ0itABKuH9dN2OTO1UlV3eo 3/A/iLyEHkyL8CW0yKYt36hmOLLtSL9QoMbUrOinsOGm3XKnipKUEVQDADj56LRZlOCZu+z4 nc8okIGxZXePmTyJjUhd3VUeEK5gyM=
  • Ironport-hdrordr: A9a23:JwkP7KFqxJEbzdIopLqE3seALOsnbusQ8zAXPidKJiC9E/b1qy nKpp4mPHDP5gr5NEtApTnEAtjkfZq+z/NICOsqTNGftdndyROVxehZhOOI/9SjIVybygc679 YDT0EUMqySMbEVt6bHCUWDYrEdKInuytHQuQ7B9QYXcT1X
  • Ironport-phdr: A9a23:QykWDhOiYPU8l5iR08Il6nYUBhdPi9zP1u491JMrhvp0f7i5+Ny6Z QqDv68r1geQFt2Co9t/yMPu+5j6XmIB5ZvT+FsjS7drEyE/tMMNggY7C9SEA0CoZNTjbig9A dgQHAQ9pyLzPkdaAtvxaEPPqXOu8zESBg//NQ1oLejpB4Lelcu62/6s95HJfQlEmCexbbxuI BmrsA7cqtQYjYx+J6gr1xDHuGFIe+NYxWNpIVKcgRPx7dqu8ZBg7ipdpesv+9ZPXqvmcas4S 6dYDCk9PGAu+MLrrxjDQhCR6XYaT24bjwBHAwnB7BH9Q5fxri73vfdz1SWGIcH7S60/VDK/5 KlpVRDokj8KOT03/m7YhMN+kaJVrgyvpxN934Hab5qYNOZ9c67HYd8WWWRMU8RXWidcAo28d YwPD+8ZMOhWtYb9uVoOogajDgSrGezv0SNIhmXo0q0+yeshEhrL0xAmH90VqnjbsM71NKYOX uyv0qbI1izOYvVL0jjy9IbGaAouoe2QXb1ua8rRz1EiGQ3Gg1mOpoLoMS2Y2+cNvmWY8+ZuW +yih3M6pwxtrDaiwtohhIfHiI8IxF3J6yV0zYU7KNCkSkN1YcOpHYdSuiycKoB4QdsiTnl1t Cs717EKo4O3cDUXxJg92hLTd+aLfoiK7x77UOudPS10iG9kdb+9nRq//1WsxvfiWsWq0VtHq DdOnMPWuXAXzRPT79CKSvtj8Uel3jaCzxzc5f9AIUwpj6bbMJEhzaQxlpYJrUvDBS72l1nsg KCIbUUo4umo6+L5bbX6vpKQKo55hhzkPqgzh8CzHP40PhUSU2SB+emx1qXv/UjjT7VLiv02n LPZsJffJckDu6G5GBNV0pw95Ba7FTim088VkmUBLF1eYh6Ik5PpO1DSL/ziE/i/mEygkDFwy P/eJL3uHo3NLmTfkLfmZbtx9lZQyBAvwtBH+5JUFrYBLeroWk/trtPYFAc5MxGtz+n6Ftp9y J0RWWaUD6+CMKLStEeI6fg1L+mNYo8Vojf9JOI/6/7gl39q0WMaKLK11IEPdTW0E+prJ22YZ 2Dti5EPCzQkpA07GcDjllyOGWpffGqoUqY15RkwDojgBpzCTcagm+rSj2+AApRKazUeWRi3G nDyetDcMx9tQCebI8s61ycBSaDkUIg5kxenqA79zbNjaOvS4CwR85z5h5Bu/+OGsxY0+HRvC tiFlXmXRjR4hX8SSjs/34h0p0Y7wUyC2u51mK8QDsRdsstASRxyLpvA16p/AtH2VBjGe4KFU 0q8T9GvAhk6SNh3yMQPaABwAIbqlQjNigytBbJdjLmXHNo0/6bbimD2PNp4wm3a2bMJhFQiG 41UOmSnwKVk9gOVCpWhf1yxsaGseOxc2SfM8DzG1m+SpARDVxY2V6zZXHcZb0+QrNLj50qEQ aX8QbIgehBMz8KPMM4oIpXgkElGSfH/Od/ff3P5mmG+AgyNz6+Na4yicnsU3SHUAkwJ2w4J+ nPOOQ87Dyan62XQaV4mXV70f1nh9eB9gHi9RAk/1ACBKUB6lvK09hMTmf2AWqYLxLtX8Cwlq jhyABO8x4ePVYPG/lM4OvwEJ45iszIlnSrDugdwP4KtNfVnj18aKEFsul/2kg5wEsNGmNQrq 3UjyExzL7iZ2RVPbWD9v9i4N7vJJ2315B3qZbTR3wSU3M2K4KoL4fcQoFPp+gqyF0Fk/m4tg Lw3mzOMo47HCgYfS8e7Xl0r7RV+rrLyaS83oYjP1HYqPLL+4Vqgk5o5QeAizBinZdJWNqiJQ RTzH8MtDM+rMOU2mlKtY3roJchq/bUvd4OjfvqCg+uwOfp42SmhlSJB6Zx81USF82x9TPTJ1 tAL2aPQ0gyCXjb6xFCv16K/0YxfeSsfGmO84SPiDshafax5O4sQQWujOMy4wNxiioWlAiABs gf6XhVfgIn0JlKbdBTl0BdV1FgLrHDC+2Pw1DFynzwz7+Ke0CHI3+X+ZU8CM29PSnNliASkK oy1gtYGGUmwOlF3zV38vB+8nvAd+fssSgubCV1FdCX3MWx4B665t77ZJtVK9IttqiJPFuK1f VGdTLf55RocySLqWWVEl1VZP3mnvIv0mxtihSeTNnF2+TDVYt1qyBPW5/TXQ/QX1SEHT291k nOEYzr0d8ns5tiSm5rZ56qyTH67W5lacgHgy4nGvzC67itnGlfs+pL70s2iGg883yjh0tBsX iidtxfwbL7g0KGiOP5mdE1lVxfsrtB3EYZkns4slYkdjDIE046N8yNNwgKReZ1LnLjzZ30XS XsXzs7JtUL7jVZ7ICvBwpKxV23BkJI/Pp/jOzpQg3p7sp8CCb/IvuUaxG0s+wP+9FqXP6AY/ H9VjPIq7DRyb/ghng0rw23dB7kTGRIdJinwj1GT6Mj4qqxLZWGpeLz21UxknNnnAqvQ6gdbE G30fJsvB0oSpo12LU7M3Xvv64rlZMiYbNQdsQeRmgvBiO4dIYw4l/4Djy5qcWznunhtx+k+h B1olZa02erPY31q5768CwVEOyfdbM4Xo3f2gq9f2M2N3ovpE44gUjQHUZ30TO65RTIfsfO0U mTGWDY4q3qdBf/eBVrGth4g9iuWVcnzcSzIdxx7hZ14SRKQJVJSmlURVTQ+xdsiExyygdfma AF/7ywQ4Vjxrl1NzPhpPl/xSDS6xk/gZzEqRZyYNBcT4BtF4hKfPdeC9O93FiJw9Zur6geWK 2fdah8CXgRrEgSUQkvuOLWj/4yK6++DGu+3NOfDe52FpOgHEeyHzJPp2ZBn9HCBLY/cWxsqR +1+0U1FU3djHs3fkDhaUC0bmRXGaMuDrQu98Cl6xihQ2P/tUkTk/oqETbZJY4wHE/GeiqqHb qiLiy94bz1D15hKyGSakND3OXYZjSsofCKsGvINrnyVJJ8=
  • Ironport-phdr: A9a23:yVvq0BNYrBC4kVd2Blcl6nYfDRdPi9zP1u491JMrhvp0f7i5+Ny6Z QqDv68r1geQFt2Ao9t/yMPu+5j6XmIB5ZvT+FsjS7drEyE/tMMNggY7C9SEA0CoZNTjbig9A dgQHAQ9pyLzPkdaAtvxaEPPqXOu8zESBg//NQ1oLejpB4Lelcu62/6s95HJfQlEmCexbbxuI Bi4sA7cqtQYjYx+J6gr1xDHuGFIe+NYxWNpIVKcgRPx7dqu8ZBg7ipdpesv+9ZPXqvmcas4S 6dYDCk9PGAu+MLrrxjDQhCR6XYaT24bjwBHAwnB7BH9Q5fxri73vfdz1SWGIcH7S60/VDK/5 KlpVRDokj8KOT03/m7YhMN+kaJVrgyvpxN934Hab5qYNOZ9c67HYd8WWWRMU8RXWidcAo28d YwPD+8ZMOhWtYb9uVoOogajDgSrGezv0SNIhmXo0q0+yeshEhrL0xAmH90VqnjbsM71NKYOX uyv0qbI1izOYvVL0jjy9IbGaAouoe2QXb1ua8rRz1EiGQ3Gg1mOpoLoMS2Y2+cNvmWY8+ZuW +yih3M6pwxtrDaiwtohhIfHiI8IxF3J6yV0zYU7KNCkSkN1YcOpHYdSuiycKoB4QdsiTnl1t Com0LEKpIK3cDQQxJg6yRPTd+aLfoaQ7h/nSOqdOyp0iXNndb6liRu+7FKsxvPiWsS11ltBs zBLncPWtn8X0hze8s2HSvxg8Ui/wTuPzAXT6v1cIUAziKrbN4Ytwr4umZoXtkTOBjT2mEDqj K+Od0Uk/PKk5Pj8YrXnupCQLZF7ihrmPqQvnMywH/g4PxATU2SH4+iwyaHv8VHjTLlXgPA6j rPVvZ7CKcQevKG5AgtV0og56xa4CjeryMkXnWIbLFJfZh2Hi5LmO1LVLf/kC/ewmE6gnytwx /DHIrLtGIvCLmPbnLfnZrly81RcxxYrzdBD+5JUDakMLOzrVk/rqNPYFgM5MxCzw+v/BtV91 4ceVniUD6+YLKzSqkWE5ucyI+mKZY8ZoiryKvk96/70kXA5gUMdfbWu3ZYPc3C4Au5pI1+BY Xrxm9sODHkFvhQgQ+zuk1CCUDhTZ2yzX60m/D07BpimXs//QdW2nLWbxDr+EpBIa2RuC1aWE H6ueZ/Xde0LbXe+K9FgnnRQWaO9UYIl3BKGvwnwjbF8L+GS9zZO5sGr78R8++CGzUJ6zjdzF cnIiwmw
  • Ironport-sdr: Md3U3QhM+PyJEySV6sttu0oBSsYJGLWSyxezQBPLeNnM5FzCbO7jVaccbFtnnU6TnGklgggQl6 iEu4GmH+/snIAgJLRitTJA2bvu+D6KDZAGTowxdXRQ0ugD2Clmsv6lx9adsjdJ8o8AlgQ1zPr9 K3RgdLqk9NVykyrfQ3zMKBSmM8L6E4aAyw1D5BavTaNdAo0PJRWzZCbJrz9QTQndIEVL4R91uR PV9ATwhORcPRbuLgWO9Bmw7YwDqGqxY/kWgvTf2oS/XzccaqryBa4LkkjAF3NmzXq+EV8XsAs5 pbseQGML8zYg1j8AVjGvCTDm

Hi,


we are still experiencing some weird behavior with StarPU schedulers.

We would be interested in showing you our application and the issues we're encountering.

Could we meet online?

We can also provide you with access to our test machine.


We are looking forward to your answer.


Kind regards,


David Strelak


On 23/2/22 10:01, David wrote:
Hi,


We have tried to change the beta (2/100/1000), as well as wrapping the task submissions loops with the starpu_pause/resume functions.

However, we have seen little to no change.


We have further analyzed our program using the task graph and tasks details logs, as well as Nsight Systems and we did not find anything obviously wrong.

With other configurations (namely single GPU worker, GTX 1070, no CPU worker), we have noticed another strange behavior - it seems like StarPU is refusing to release memory from already finished tasks, which results in memory trashing later on, and in some cases even deadlock on the memory allocation. It is similar to the oneGPUAllCPU case we have shown you previously.


In case it's related, we're using StarPU 1.3.7, however, when we tried StarPU master branch (from today), our program crashed with the following message:

../../src/datawizard/coherency.c:326: _starpu_determine_request_path: Assertion `src_node >= 0 && dst_node >= 0' failed.


In any case, we wonder if you would be willing to spend some time on a video call with us, and we would show you our findings.

If yes, please propose a few one to two hours time slots.


KR,


David Strelak



On 21/2/22 14:25, Samuel Thibault wrote:
Hello,

David, le mer. 09 févr. 2022 18:57:33 +0100, a ecrit:
The program is PCI-e bound, i.e. the data transfer time is one of the
crucial parameters which has to be taken into account by the scheduler.
Indeed, it seems that most of the time is just spent on transferring
data. See the attached snap of the noGPUallCPU.trace file, the GPU is
almost always in the purple FetchingInput state, so it's just waiting
for data to come.

1. use case: single GPU, no CPU (oneGPUnoCPU):

execution time: 7.2s

2. use case: use both GPUs, no CPU (twoGPUnoCPU):

execution time: 7.6s
I am not really surprised. Depending on the details of the actual
motherboard, data transfers may not be achievable in parallel at all,
and then no gain is to be expected :/

Our questions are:

1. Do you think that the dmdar scheduler is not able to correctly take into
account memory transfers, and thus under-utilizes GPU 2 in case 2 and case
3, and causes prolonged computation in case 5?
dmdar guesstimates the amount of time that data transfers will take, the
actual time depends a lot on the CUDA state, transfers interferences,
etc. so it's really only a heuristic.  Here the transfers are really
the key problem, so I'm not surprised that dmdar is the best performing
scheduler, but I'm not that surprised either that it does not manage to
find a better way of scheduling tasks.

You can try to tune the dmdar behavior with the STARPU_SCHED_BETA
variable, possibly that would help.

2. How can we verify that it is the scheduler to be blamed, and not some
problem in our code?
On 1gpu the vite trace shows that the gpu is kept busy transferring
data, so it's not really the way to submit tasks which is the problem,
but the sheer amount of data to transfer on the PCI bus. For such kind
of application you will really want an NVlink between your main memory
and the GPU. That means moving to the powerpc architecture since x86
doesn't allow such RAM/GPU NVlink.

Now on the 2gpu trace we'd expect starpu to be trying to execute
FFTStarPU in parallel on the two gpus and that's not the case, cuda0
is left idle (red). That is not expected. What I notice is that the
number of submitted tasks (the white line at the top of the trace) is
still increasing during the iteration, i.e. not all tasks are actually
submitted at this point, and indeed in the UserThread14597 bar at the
bottom there are "Submitting task" states showing up (one has to zoom in
to see them). So possibly StarPU just doesn't know yet that there are
other FFTStarPU tasks to do, since they're not submitted yet.

So probably there is a problem in your code, something that seems
to be synchronizing with starpu, or otherwise taking a lot of time,
thus delaying the task submission. For a start you can try to use
starpu_pause/resume() around your task submission, to make sure
submissions are all done before starpu starts executing tasks. Possibly
your program will then hang because it is trying to synchronize with the
execution of tasks, but then you'll see where it gets stuck, which is
where it shouldn't actually be trying to synchronize.

3. Does StarPU try to transfer data between multiple GPUs in parallel?
Yes, it tries to transfer everything it can in parallel.

Samuel


  • Re: [starpu-devel] Suspicious behavior of StarPU dmdar scheduler, David, 18/04/2022

Archives gérées par MHonArc 2.6.19+.

Haut de le page