starpu-devel - [Starpu-devel] automatic RAM allocation and CUDA worker issue

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] automatic RAM allocation and CUDA worker issue

From: Kevin Juilly <kevin.juilly@eolen.com>
To: <starpu-devel@lists.gforge.inria.fr>
Subject: [Starpu-devel] automatic RAM allocation and CUDA worker issue
Date: Mon, 13 Nov 2017 14:28:31 +0100
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=kevin.juilly@eolen.com; spf=None smtp.mailfrom=kevin.juilly@eolen.com; spf=None smtp.helo=postmaster@mail.leonix.fr
Ironport-phdr: 9a23:JMwgQB+zZ7wsQf9uRHKM819IXTAuvvDOBiVQ1KB+1O0cTK2v8tzYMVDF4r011RmSAtWdtqoMotGVmp6jcFRI2YyGvnEGfc4EfD4+ouJSoTYdBtWYA1bwNv/gYn9yNs1DUFh44yPzahANS46tL2HV9ymp8TcIAgi6OQdrK+DdHo/Jk9/x2O614ZLeJQROnju0J71oekaYtwLU4+obn4pkYoQsyx/NszMceOlIxGUuJ0+SmxLtzsq3+JNltS9XvqRypIZ7TazmcvFgHvRjBzM8PjVt6Q==
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

On a node with a GPU, if a program asks for more memory than the GPU got, the data are registered with -1 as home_node and all but the GPU worker are disabled, StarPU 1.2 will abort on an assert (see assert.log) even if the need of any task is well under the size of the GPU memory.

The problem doesn't seem to occur when STARPU_PREFETCH=0.

A reproducer is attached. This code allocate a lot of square matrices and start task on them. The tasks themselves do not do any work and are only cuda_func, their only purpose is to force memory management to occur on the GPU

The program takes two optionnal parameters:
- the size of the matrices
- the number of matrices
When no argument are given, it will try to allocate enough matrices to use 3 times the size of the GPU memory. The assertion doesn't occur when not allocating enough, even if it is more than the size of the GPU memory.

This reproduces the memory behaviour of the test case that triggered the bug.

Also attached you'll find : a config.log extract and the list of StarPU environment variables used.

As a note, the same case produced incorrect behaviour with StarPU 1.1. In this case the worker was stuck (it seems) in an infinite loop inside _starpu_fetch_task_input (calling function to try to free memory). I haven't been able to reproduce it recently and have no idea why.

Regards,
Kevin Juilly
AS+ groupe Eolen
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_select_src_node+0x1fe)[0x7fa750b732ee]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_create_request_to_fetch_data+0x37e)[0x7fa750b73c4e]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(+0x7bede)[0x7fa750b84ede]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(+0x78890)[0x7fa750b81890]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_memory_reclaim_generic+0x80)[0x7fa750b80fc0]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(+0x7af8a)[0x7fa750b83f8a]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_allocate_memory_on_node+0x5e)[0x7fa750b831ce]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_driver_copy_data_1_to_1+0x91)[0x7fa750b795e1]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(+0x6e15a)[0x7fa750b7715a]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(__starpu_datawizard_progress+0x39)[0x7fa750b78e49]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_wait_data_request_completion+0x4a)[0x7fa750b7690a]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_fetch_data_on_node+0x168)[0x7fa750b74ba8]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_fetch_task_input+0x185)[0x7fa750b75565]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(+0xa9a05)[0x7fa750bb2a05]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_cuda_driver_run_once+0x483)[0x7fa750bb26f3]
/home/kjuilly/test-starpu-alloc/installs/spu-122/lib/libstarpu-1.2.so.2(_starpu_cuda_worker+0x1d)[0x7fa750bb2fed]
/lib64/libpthread.so.0(+0x7dc5)[0x7fa74d402dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fa74ed00ced]
spu-122-test: ../../starpu-1.2.2/src/datawizard/coherency.c:103: int
_starpu_select_src_node(starpu_data_handle_t, unsigned int): Assertion
`handle->per_node[src_node].initialized' failed.

#include <starpu.h>

void useless(void **a, void *b) {}
void noop(void **a, void *b) {}

struct starpu_codelet cl = { .cuda_func = useless,
                             .nbuffers = 3,
                             .modes = { STARPU_RW, STARPU_R, STARPU_R } };

static struct starpu_codelet fake_init_cl = {
  .cuda_funcs = { noop },
  .nbuffers = 1
};
int main(int argc, char **argv) {

  size_t x = 100;
  size_t n = 22;
  if (argc > 2) {
    x = atoi(argv[1]);
    n = atoi(argv[2]);
  } else {
      size_t free, total;
      cudaMemGetInfo(&free, &total);
      total /= 8;
      total /= 22;
      size_t temp_total;
      size_t nb = 1;
      size_t len;
      while (1) {
          temp_total = total *3 /nb;
           len = sqrt(temp_total);
          if (len < 2000)
              break;

          nb++;
      }
      x = len;
      n = nb*22;
  }

  starpu_init(NULL);

  starpu_data_handle_t *handles = malloc(n * sizeof *handles);
  for (size_t i = 0; i < n; i++) {
    starpu_matrix_data_register(handles + i, -1, (uintptr_t)0, x, x, x,
                                sizeof(double));
    starpu_data_set_reduction_methods(handles[i], 0, &fake_init_cl);
  }

  for (int it = 0; it < 1 << 15; it++) {
    int i, j, k;
    i = rand() % n;
    j = rand() % n;
    k = rand() % n;
    starpu_insert_task(&cl, STARPU_RW, handles[i], STARPU_R, handles[j],
                       STARPU_R, handles[k], 0);
  }

  starpu_task_wait_for_all();
  for (size_t i = 0; i < n; i++) {
    starpu_data_unregister(handles[i]);
  }
  starpu_shutdown();
}

CC=clang CXX=clang++ ../starpu-1.2.2/configure
--prefix=/home/kjuilly/test-starpu-alloc/build-spu-122/../installs/spu-122
--with-cuda-dir=/usr/local/cuda-7.5 --disable-build-doc --without-mpicc
--disable-fortran

CUDA enabled: yes
OpenCL enabled: yes
SCC enabled: no
MIC enabled: no

Compile-time limits
(change these with --enable-maxcpus, --enable-maxcudadev,
--enable-maxopencldev, --enable-maxmicdev, --enable-maxnodes,
--enable-maxbuffers)
(Note these numbers do not represent the number of detected
devices, but the maximum number of devices StarPU can manage)

Maximum number of CPUs: 64
Maximum number of CUDA devices: 4
Maximum number of OpenCL devices: 8
Maximum number of SCC devices: 0
Maximum number of MIC threads: 0
Maximum number of memory nodes: 16
Maximum number of task buffers: 8

GPU-GPU transfers: yes
Allocation cache: yes

Magma enabled: no
BLAS library: none
hwloc: yes
FxT trace enabled: no
StarPU-Top: yes

Documentation: no
Examples: yes

StarPU Extensions:
MPI enabled: no
MPI test suite: no
FFT Support: yes
GCC plug-in: no
GCC plug-in test suite (requires GNU Guile): no
OpenMP runtime support enabled: no
SOCL enabled: yes
SOCL test suite: no
Scheduler Hypervisor: no
simgrid enabled: no
ayudame enabled: no
Native fortran support: no
Native MPI fortran support:
STARPU_NCUDA=1
STARPU_OPENCL_NO_CUDA=1
STARPU_OPENCL_ON_CPUS=1
STARPU_WORKERS_CUDAID=0
STARPU_WORKERS_OPENCLID=1
STARPU_WORKER_STATS=1
STARPU_MEMORY_STATS=1
STARPU_BUS_STATS=1
STARPU_PROFILING=1
STARPU_SCHED=dmda
STARPU_NCPU=0
STARPU_NOPENCL=0

[Starpu-devel] automatic RAM allocation and CUDA worker issue, Kevin Juilly, 13/11/2017
- Re: [Starpu-devel] automatic RAM allocation and CUDA worker issue, Olivier Aumage, 30/11/2017

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

[Starpu-devel] automatic RAM allocation and CUDA worker issue