starpu-devel - Re: [Starpu-devel] [trace interpretation]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [trace interpretation]

From: Maxim Abalenkov <maxim.abalenkov@gmail.com>
To: Samuel Thibault <samuel.thibault@inria.fr>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] [trace interpretation]
Date: Mon, 6 Aug 2018 13:29:32 +0100
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=maxim.abalenkov@gmail.com; spf=Pass smtp.mailfrom=maxim.abalenkov@gmail.com; spf=None smtp.helo=postmaster@mail-qt0-f171.google.com
Ironport-phdr: 9a23:Ra86GRQpahCv8dXho9U7elSo99psv+yvbD5Q0YIujvd0So/mwa6yYRGN2/xhgRfzUJnB7Loc0qyK6/6mATRIyK3CmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+KPjrFY7OlcS30P2594HObwlSizexfbJ/IA+qoQnNq8IbnZZsJqEtxxXTv3BGYf5WxWRmJVKSmxbz+MK994N9/ipTpvws6ddOXb31cKokQ7NYCi8mM30u683wqRbDVwqP6WACXWgQjxFFHhLK7BD+Xpf2ryv6qu9w0zSUMMHqUbw5Xymp4rx1QxH0ligIKz858HnWisNuiqJbvAmhrAF7z4LNfY2ZKOZycqbbcNgHR2ROQ9xRWjRDDYOyb4UBAekPM/tGoYbhvFYOqAeyCBO2Ce/z1jNFhHn71rA63eQ7FgHG2RQtEdUUv3XbrdX1MboZXPyuw6bSyTXMcfVW2TT66IjWbxsspvSMUqh/cMrQzEkjDRnKgU6KpozhITyV0OcNs2+F7+d7WuKvjnQoqwB1ojS12sgsjYzJi5sTx1vZ+yt5x4M1Kse5SE59edOkH5pQtz2eN4RsWcwuWWBouCE8x7YbupC7ZDAHxIo7yxPbcfCKcIiF7gj9WOqMIDp0nm9pdbCiixu07EOu0PfzVtOu31ZPtidFksfDtnQK1xHL78iIUPp9/kO41TaWywDf9vhIIU4pmafZNpIt2LEwlp0UsUTMGi/5hl/6g7ORdkUh4uSo6uLnbav6ppKEKYN4lgXzPr4tl8G/G+g0LBUCUmmB9eih1rDv4FX1QLBQgf03lqnZvoraJcMepqOhBg9ayIki6xe6Dzu8ytsXhmMILFZbdxKBjIjpPE/OLev3Dfe6mVuskTNry+raMb3mB5XBNmLDn6v5fbZh905czxI+zctD551OELEBOO/zVlbsu9PGEB82LQi0zv3jCNV8zYMeRXmPDrWWMKPctl+I/O0vLPeWaI8Uvjb9Mfkl6OT0gX83g19ONZWuiKAebW21GrxaI0STaGfonp9VCm4Powc6CvDqiVeLTDpPT3e0RaM1oD8hXtGIF4DGE6WkkL2ElA6xF5lbYGNBFBjYFH74doDCUv0FbC+UIch/ujMBXLmlDYQm0Ef950fB17N7I7+MqWUjvpX52Y0wvrWLzEBgxXlPF82Yllq1YSRxl2IMSSUx2fkm80N4w1aHl6N/hq4BTIAB17ZySg4/cKXk4aliEdmrA1DOe96ITBCtRdD0WWhsHOJ0+McHZgNGI/vnjh3H2HD0UboclrjOHYBst6yFgCG3KMF6xHLLkqImigt+Tw==
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Dear all,

I have implemented the first version of my LU factorisation code with contexts. Please see the skeleton of the code attached.

It completes the execution, but somewhere towards the end it crashes with the following error:

free(): invalid size
Aborted (core dumped)

I have also ran the code through Valgrind. Please see Valgrind's output attached as well.

I believe the problem might be with the reduction routine "core_starpu_dcabs1".

    // set methods to define neutral elements, perform reduction operation
    starpu_data_set_reduction_methods(
            hp, &core_starpu_codelet_dcabs1_redux, &core_starpu_codelet_dcabs1_init);

    // @test reset pivot values
    core_starpu_dcabs1_init(hp, sched_ctx, prio);

Since I call the reduction function multiple times throughout the program I would also like to "reset/reinitialise" the pivot's value "hp". Therefore, I explicitly call the "_init" routine that relies on the same codelet "core_starpu_codelet_dcabs1_init". Is this a legitimate thing to do? What would be a recommended approach?

Another question I have is related to the application of contexts. In my programming scenario I dedicate a subset of threads called "mtpf" to perform tasks for the LU panel factorisation. The other larger group of threads is occupied with the other parts of the algorithm. Could you please take a look at my "skeleton" code to check that I'm using the contexts correctly? The "mtpf" threads factorise each panel of the matrix panel by panel. Is it recommended to merge the "mtpf" threads with the other "main" group at the end of each panel factorisation and dedicate them anew at the beginning of next panel OR keep the "mtp" separate until the end of the code?

Thank you and have a good day ahead!

Best wishes,

Maxim

On 31 July 2018 at 16:20, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Hello Samuel,

Thank you very much for the directions. I have found plenty of examples of scheduling contexts. The simplest one I would like to mimic in my code is “two_cpu_contexts.c”. Do you think it is an appropriate example to start from?

Here is what I would like to achieve. Assume my algorithm is an LU factorisation of the matrix. It has three computation parts: a) pivot search and processing of the first panel of tiles, b) trailing matrix update and c) pivoting to the left. I would like to dedicate a subset of tasks to perform the a) part. As I see it I will have to:

1) create two contexts (i) for part a and (ii) for parts b and c.
2) launch context (i) and (ii) simultaneously and let them perform their tasks
3) once context (i) has finished merge it with context (ii)
4) re-create context (i) for the subsequent panel

Maybe it is not optimal and I shouldn’t merge at all? Or merge only once all of the panels have been processed (part a) has been completed for all of the panels). Please let me know, if my plan is a valid one. Thank you and have a good day ahead!

—
Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 26 Jul 2018, at 18:08, Samuel Thibault <samuel.thibault@inria.fr> wrote:

Maxim Abalenkov, le jeu. 26 juil. 2018 17:58:14 +0100, a ecrit:

Yes.

Is there an example of scheduling contexts
somewhere I could take a look at? Thank you!

There are, yes, see examples/sched_ctx*

Samuel

==22260== Memcheck, a memory error detector
==22260== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==22260== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==22260== Command: ./test zgetrf --mtpf=2
==22260==

==22260== Conditional jump or move depends on uninitialised value(s)
==22260== at 0x12D603: print_header (test.c:362)
==22260== by 0x12D2DE: main (test.c:243)
==22260==
Status Error Time Gflop/s m n nb ib padA
mtpf zerocol

PLASMA ERROR at 37 of plasma_tuning_init() in control/tuning.c:
PLASMA_TUNING_FILENAME not set
[starpu][starpu_initialize] Warning: StarPU was configured with --with-fxt,
which slows down a bit, limits scalability and makes worker initialization
sequential
[sched_ctx 1]: 2 workers
CPU 0
CPU 1
[sched_ctx 2]: 2 workers
CPU 2
CPU 3
==22260== Invalid read of size 1
==22260== at 0x562A160: starpu_variable_data_register
(variable_interface.c:120)
==22260== by 0x4EDCCDC: plasma_zgetrf (zgetrf.c:186)
==22260== by 0x135293: test_zgetrf (test_zgetrf.c:120)
==22260== by 0x12DD2C: run_routine (test.c:501)
==22260== by 0x12D7A9: test_routine (test.c:397)
==22260== by 0x12D32D: main (test.c:250)
==22260== Address 0x1a74a0e0 is 0 bytes after a block of size 96 alloc'd
==22260== at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
==22260== by 0x4EDCB75: plasma_zgetrf (zgetrf.c:171)
==22260== by 0x135293: test_zgetrf (test_zgetrf.c:120)
==22260== by 0x12DD2C: run_routine (test.c:501)
==22260== by 0x12D7A9: test_routine (test.c:397)
==22260== by 0x12D32D: main (test.c:250)
==22260==
==22260== Invalid read of size 1
==22260== at 0x562A167: starpu_variable_data_register
(variable_interface.c:121)
==22260== by 0x4EDCCDC: plasma_zgetrf (zgetrf.c:186)
==22260== by 0x135293: test_zgetrf (test_zgetrf.c:120)
==22260== by 0x12DD2C: run_routine (test.c:501)
==22260== by 0x12D7A9: test_routine (test.c:397)
==22260== by 0x12D32D: main (test.c:250)
==22260== Address 0x1a74a0f7 is 23 bytes after a block of size 96 alloc'd
==22260== at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
==22260== by 0x4EDCB75: plasma_zgetrf (zgetrf.c:171)
==22260== by 0x135293: test_zgetrf (test_zgetrf.c:120)
==22260== by 0x12DD2C: run_routine (test.c:501)
==22260== by 0x12D7A9: test_routine (test.c:397)
==22260== by 0x12D32D: main (test.c:250)
==22260==
==22260== Thread 2 CPU 0:
==22260== Invalid write of size 4
==22260== at 0x527220F: core_starpu_cpu_dcabs1_init (core_dcabs1.c:107)
==22260== by 0x5676025: execute_job_on_cpu (driver_cpu.c:102)
==22260== by 0x5676025: _starpu_cpu_driver_execute_task (driver_cpu.c:235)
==22260== by 0x56767C5: _starpu_cpu_driver_run_once (driver_cpu.c:295)
==22260== by 0x5676A9C: _starpu_cpu_worker (driver_cpu.c:407)
==22260== by 0xC1FB074: start_thread (in /usr/lib/libpthread-2.27.so)
==22260== by 0xC50A53E: clone (in /usr/lib/libc-2.27.so)
==22260== Address 0x1a74a338 is 24 bytes after a block of size 512 in arena
"client"
==22260==

valgrind: m_mallocfree.c:307 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi'
failed.
valgrind: Heap block lo/hi size mismatch: lo = 576, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.

host stacktrace:
==22260== at 0x580441BA: show_sched_status_wrk (m_libcassert.c:355)
==22260== by 0x580442D4: report_and_quit (m_libcassert.c:426)
==22260== by 0x58044459: vgPlain_assert_fail (m_libcassert.c:492)
==22260== by 0x58052FC0: get_bszB_as_is (m_mallocfree.c:305)
==22260== by 0x58052FC0: get_bszB (m_mallocfree.c:315)
==22260== by 0x58052FC0: get_pszB (m_mallocfree.c:389)
==22260== by 0x58052FC0: vgPlain_describe_arena_addr (m_mallocfree.c:1592)
==22260== by 0x5803CECA: vgPlain_describe_addr (m_addrinfo.c:186)
==22260== by 0x5803B5D3: vgMemCheck_update_Error_extra (mc_errors.c:1186)
==22260== by 0x5803FC9D: vgPlain_maybe_record_error (m_errormgr.c:812)
==22260== by 0x5803A8CB: vgMemCheck_record_address_error (mc_errors.c:767)
==22260== by 0x1003BE9C5D: ???
==22260== by 0x10097CEF1F: ???
==22260== by 0x381F: ???
==22260== by 0x1002009F6F: ???
==22260== by 0x10097CEF07: ???
==22260== by 0x10097CEF1F: ???
==22260== by 0x205: ???
==22260== by 0x58FB6C7: ???
==22260== by 0x102C7: ???

sched status:
running_tid=2

Thread 1: status = VgTs_WaitSys (lwpid 22260)
==22260== at 0xC200FFC: pthread_cond_wait@@GLIBC_2.3.2 (in
/usr/lib/libpthread-2.27.so)
==22260== by 0x558EE2D: _starpu_barrier_counter_wait_for_empty_counter
(barrier_counter.c:46)
==22260== by 0x55D6A63: _starpu_wait_for_all_tasks_of_sched_ctx
(sched_ctx.c:1457)
==22260== by 0x55A028B:
_starpu_task_wait_for_all_in_ctx_and_return_nb_waited_tasks (task.c:932)
==22260== by 0x55A02C8: starpu_task_wait_for_all_in_ctx (task.c:941)
==22260== by 0x55A039D:
_starpu_task_wait_for_all_and_return_nb_waited_tasks (task.c:909)
==22260== by 0x55A0468: starpu_task_wait_for_all (task.c:925)
==22260== by 0x4EDCE79: plasma_zgetrf (zgetrf.c:272)
==22260== by 0x135293: test_zgetrf (test_zgetrf.c:120)
==22260== by 0x12DD2C: run_routine (test.c:501)
==22260== by 0x12D7A9: test_routine (test.c:397)
==22260== by 0x12D32D: main (test.c:250)

Thread 2: status = VgTs_Runnable (lwpid 22263)
==22260== at 0x5272221: core_starpu_cpu_dcabs1_init (core_dcabs1.c:108)
==22260== by 0x5676025: execute_job_on_cpu (driver_cpu.c:102)
==22260== by 0x5676025: _starpu_cpu_driver_execute_task (driver_cpu.c:235)
==22260== by 0x56767C5: _starpu_cpu_driver_run_once (driver_cpu.c:295)
==22260== by 0x5676A9C: _starpu_cpu_worker (driver_cpu.c:407)
==22260== by 0xC1FB074: start_thread (in /usr/lib/libpthread-2.27.so)
==22260== by 0xC50A53E: clone (in /usr/lib/libc-2.27.so)

Thread 3: status = VgTs_WaitSys (lwpid 22264)
==22260== at 0xC4F1E47: sched_yield (in /usr/lib/libc-2.27.so)
==22260== by 0x560A15C: ___starpu_datawizard_progress (datawizard.c:39)
==22260== by 0x560A274: __starpu_datawizard_progress (datawizard.c:98)
==22260== by 0x56766D1: _starpu_cpu_driver_run_once (driver_cpu.c:300)
==22260== by 0x5676A9C: _starpu_cpu_worker (driver_cpu.c:407)
==22260== by 0xC1FB074: start_thread (in /usr/lib/libpthread-2.27.so)
==22260== by 0xC50A53E: clone (in /usr/lib/libc-2.27.so)

Thread 4: status = VgTs_WaitSys (lwpid 22265)
==22260== at 0xC4F1E47: sched_yield (in /usr/lib/libc-2.27.so)
==22260== by 0x55FE854: _starpu_exponential_backoff (driver_common.c:341)
==22260== by 0x55FE854: _starpu_get_worker_task (driver_common.c:451)
==22260== by 0x56766DE: _starpu_cpu_driver_run_once (driver_cpu.c:303)
==22260== by 0x5676A9C: _starpu_cpu_worker (driver_cpu.c:407)
==22260== by 0xC1FB074: start_thread (in /usr/lib/libpthread-2.27.so)
==22260== by 0xC50A53E: clone (in /usr/lib/libc-2.27.so)

Thread 5: status = VgTs_WaitSys (lwpid 22266)
==22260== at 0xC4F1E47: sched_yield (in /usr/lib/libc-2.27.so)
==22260== by 0x55FE854: _starpu_exponential_backoff (driver_common.c:341)
==22260== by 0x55FE854: _starpu_get_worker_task (driver_common.c:451)
==22260== by 0x56766DE: _starpu_cpu_driver_run_once (driver_cpu.c:303)
==22260== by 0x5676A9C: _starpu_cpu_worker (driver_cpu.c:407)
==22260== by 0xC1FB074: start_thread (in /usr/lib/libpthread-2.27.so)
==22260== by 0xC50A53E: clone (in /usr/lib/libc-2.27.so)

Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using. Thanks.
int ncpu = starpu_cpu_worker_get_count();

int *ids = calloc(ncpu, sizeof(int));
int *team1 = calloc(mtpf, sizeof(int));
int *team2 = calloc(ncpu-mtpf, sizeof(int));

starpu_worker_get_ids_by_type(STARPU_CPU_WORKER, ids, ncpu);

// Create teams of processors
for (int i = 0; i < mtpf; i++) {
team1[i] = ids[i];
}

for (int j = 0; j < ncpu-mtpf; j++) {
team2[j] = ids[j+mtpf];
}

// Create schedulling context 1 with default policy
unsigned sched_ctx1 = starpu_sched_ctx_create(team1, mtpf, "ctx1",
STARPU_SCHED_CTX_POLICY_NAME, "", 0);

// Create schedulling context 2 with default policy
unsigned sched_ctx2 = starpu_sched_ctx_create(team2, ncpu-mtpf, "ctx2",
STARPU_SCHED_CTX_POLICY_NAME, "", 0);

// Set inheritor
starpu_sched_ctx_set_inheritor(sched_ctx2, sched_ctx1);

// @test Display workers
starpu_sched_ctx_display_workers(sched_ctx1, stderr);
starpu_sched_ctx_display_workers(sched_ctx2, stderr);

// Call the panel async function
plasma_starpu_zgetrf(A, dPivVec, dPivSeg, hp,
sched_ctx1, sched_ctx2,
&sequence, &request);

// Synchronize
starpu_task_wait_for_all();

// Destroy schedulling contexts
starpu_sched_ctx_delete(sched_ctx1);
starpu_sched_ctx_delete(sched_ctx2);

free(ids); free(team1); free(team2);

//----------------------------------------------------------------------

// StarPU task inserter (diagonal tile)
void core_starpu_dcabs1_diag(starpu_data_handle_t hp,
starpu_data_handle_t a0,
plasma_desc_t A, int j,
unsigned sched_ctx, int prio) {

int retval = starpu_task_insert(
&core_starpu_codelet_dcabs1_diag,
STARPU_REDUX, hp,
STARPU_R, a0,
STARPU_VALUE, &A, sizeof(plasma_desc_t),
STARPU_VALUE, &j, sizeof(int),
STARPU_SCHED_CTX, sched_ctx,
STARPU_PRIORITY, prio,
STARPU_NAME, "dcabs1_diag",
0);

STARPU_CHECK_RETURN_VALUE(retval,
"core_starpu_dcabs1_diag: starpu_task_insert() failed");
}

Re: [Starpu-devel] [trace interpretation], Maxim Abalenkov, 06/08/2018
- Re: [Starpu-devel] [trace interpretation], Samuel Thibault, 06/08/2018
- <Suite(s) possible(s)>
- Re: [Starpu-devel] [trace interpretation], Samuel Thibault, 06/08/2018

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] [trace interpretation]