Objet : Developers list for StarPU
Archives de la liste
Re: [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions
Chronologique Discussions
- From: Samuel Thibault <samuel.thibault@ens-lyon.org>
- To: Marc Sergent <marc.sergent@inria.fr>
- Cc: starpu-devel@lists.gforge.inria.fr
- Subject: Re: [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions
- Date: Tue, 6 May 2014 14:03:54 +0200
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hello,
Marc Sergent, le Tue 06 May 2014 13:32:33 +0200, a écrit :
> Program received signal SIGABRT, Aborted.
> 0x00007ffff3cfa645 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00007ffff3cfa645 in raise () from /lib64/libc.so.6
> #1 0x00007ffff3cfbc33 in abort () from /lib64/libc.so.6
> #2 0x00007ffff3cf3329 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007ffff39fe3cb in starpu_malloc_flags (A=0x7fffffffa388,
> dim=294912, flags=<optimized out>) at ../../src/datawizard/malloc.c:220
> #4 0x00007ffff39fe592 in _starpu_malloc_on_node (dst_node=0, size=294912)
> at ../../src/datawizard/malloc.c:380
> (gdb) f 3
> #3 0x00007ffff39fe3cb in starpu_malloc_flags (A=0x7fffffffa388,
> dim=294912, flags=<optimized out>) at ../../src/datawizard/malloc.c:220
> 220 STARPU_ASSERT(*A);
> (gdb) p *A
> $1 = (void *) 0x0
You could also check errno, but I guess it is ENOMEM. I also guess
_malloc_align is still sizeof(void*), not something else, so it's the
simple malloc() case which was used. Actually the code is a bit odd:
it's only in the posix_memalign case that would return -ENOMEM instead
of just crashing. We will probably want to fix that for a memory-aware
version of StarPU, but we are not there yet.
So malloc failed... 86400*86400*8 is 55GiB, which is quite big,
but not so much. You should probably check used_size[0] (in the
memory manager) to make sure how much we have allocated. Apparently
/proc/sys/vm/overcommit_memory is 0 on fourmi nodes, perhaps you want to
check that is is so on the precise fourmi nodes that you got.
You can also check in strace what the kernel actually returned to
userland.
Now, I guess the actual issue is, as you said, the huge number of tiles
(450*450, i.e. 202500). We have not really optimized the starpu handle
and replicate structures size so much, you can probably try to reduce
some MAX limits at least: STARPU_MAXNODES, STARPU_NMAXWORKERS. We may
want to allocate them through starpu_malloc_flags(STARPU_MALLOC_COUNT)
instead of just malloc, so that that memory consumption gets accounted
too.
Samuel
- [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions, Marc Sergent, 06/05/2014
- Re: [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions, Samuel Thibault, 06/05/2014
- Re: [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions, Samuel Thibault, 06/05/2014
- Re: [Starpu-devel] Bug when dealing with a huge number of tiles during distributed executions, Samuel Thibault, 06/05/2014
Archives gérées par MHonArc 2.6.19+.