starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Maxim Abalenkov <maxim.abalenkov@gmail.com>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
Date: Sat, 10 Feb 2018 12:17:58 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Maxim,

I am not familiar with the peager implementation of StarPU (nor, I believe,
Samuel). I have had a quick look at the peager policy code, and there seems
to be an issue with the initialization phase of the policy. Or perhaps I do
not get the rationale of it...

Can you check if the quick patch in attachment improve the scalability of
your code? You can apply it with the following command:
$ patch -p1 <../peager.patch

This is only meant to be a temporary fix, however. I need to check with
people who wrote the code about what the initial intent was.

Hope this helps.
Best regards,
--
Olivier

Attachment: peager.patch
Description: Binary data

> Le 8 févr. 2018 à 16:19, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
> écrit :
>
> Dear all,
>
> I have implemented the parallel panel factorisation in LU with the StarPU’s
> SPMD capability. Here are a few answers to my own questions:
>
>> 1) Am I passing the barrier structure correctly, so that it is “shared"
>> amongst all the threads and the threads “know” about the status of the
>> other threads. To achieve this I pass the barrier structure by reference.
>
> Yes, it is passed correctly. All other threads "know about” and share the
> values inside the barrier structure.
>
>> 2) Maybe it is the tile descriptors that “block” the execution of the
>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>> proceed, since all the tiles are blocked by rank 0? Therefore, I can make
>> a conclusion that “blocking” the tiles like I do is incorrect?
>
> Tile “blocking” is correct. The problem did not lie in the tile “blocking”,
> but rather in the application of a non-parallel StarPU scheduler. According
> to the StarPU handbook only two schedulers “pheft” and “peager” support the
> SPMD mode of execution.
>
>> 3) Is there a way to pass a variable to the codelet to set the
>> “max_parallelism” value instead of hard-coding it?
>
> Since the codelet is a static structure, I am setting the maximum number of
> threads by accessing the “max_parallelism" value as follows. It is set
> right before inserting the SPMD task:
>
> // Set maximum no. of threads per panel factorisation
> core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;
>
> Please find attached a performance plot of LU factorisation (with and
> without SPMD functionality) executed on a 20 core Haswell machine. I
> believe, something goes terribly wrong since the SPMD performance numbers
> are so low. I have used the following command to execute the tests:
>
> export MKL_NUM_THTREADS=20
> export OMP_NUM_THTREADS=20
> export OMP_PROC_BIND=true
> export STARPU_NCPU=20
> export STARPU_SCHED=peager
> export PLASMA_TUNING_FILENAME=...
>
> numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=... —mtpf=...
> —iter=...
>
> Any insight and help in recovering the performance numbers would be greatly
> appreciated. Thank you and have a good day!
>
> —
> Best wishes,
> Maxim
>
> <haswell_dgetrf2_starpu.pdf>
>
> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>
>> On 5 Feb 2018, at 12:16, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:
>>
>> Dear all,
>>
>> I’m on a mission to apply the SPMD capability of the StarPU
>> (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks)
>> for a panel factorisation stage of the LU algorithm. Please see the figure
>> attached for an example of my scenario.
>>
>> The matrix is viewed as a set of tiles (rectangular or square matrix
>> blocks). A column of tiles is called a panel.
>>
>> In the first stage of the LU algorithm I would like to take a panel, find
>> the pivots, swap the necessary rows, scale and update the underlying
>> matrix elements. To track the dependencies I created tile descriptors,
>> that keep the information about the access mode and the tile handle.
>> Essentially, the tile descriptors are used to “lock” the entire panel, all
>> the operations inside are parallelised manually using a custom barrier and
>> auxiliary arrays, to store the maximum values and their indices. To be
>> able to assign a particular task to a thread (processing the panel
>> factorisation) I use ranks. Depending on a rank each thread will get its
>> portion of the data to work on. Inside the panel threads are synchronised
>> manually and wait for each other at the custom barrier.
>>
>> Please pay attention to the attached figure. A panel consisting of five
>> tiles is passed to the StarPU task. Imagine we have three treads
>> processing the panel. To find the first pivot we assign the first column
>> of each tile to a certain thread in the Round-Robin manner (0,1,2,0,1).
>> Once the maximum per tile is found by each thread, the master thread (with
>> rank 0) will select the global maximum. I would like to apply the SPMD
>> capability of StarPU to process the panel and use a custom barrier inside.
>>
>> Please consider the C code below. The code works, but the threads wait
>> infinitely at the first barrier. My questions are:
>>
>> 1) Am I passing the barrier structure correctly, so that it is “shared"
>> amongst all the threads and the threads “know” about the status of the
>> other threads. To achieve this I pass the barrier structure by reference.
>> 2) Maybe it is the tile descriptors that “block” the execution of the
>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>> proceed, since all the tiles are blocked by rank 0? Therefore, I can make
>> a conclusion that “blocking” the tiles like I do is incorrect?
>> 3) Is there a way to pass a variable to the codelet to set the
>> “max_parallelism” value instead of hard-coding it?
>>
>> 4) If I may, I would like to make a general comment, please. I like StarPU
>> very much. I think you have invested a great deal of time and effort into
>> it. Thank you. But to my mind the weakest point (from my user experience)
>> is passing the values to StarPU, while inserting a task. There is no type
>> checking of the variables here. The same applies to the routine
>> “starpu_codelet_unpack_args()”, when you want to obtain the values “on the
>> other side”. Sometimes, it becomes a nightmare and a trial-and-error
>> exercise. If the type checks could be enforced there, it would make a
>> user’s life much easier.
>>
>> // StarPU LU panel factorisation function
>> /******************************************************************************/
>> void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int *piv,
>> volatile int *max_idx, volatile plasma_complex64_t
>> *max_val,
>> int ib, int rank, int mtpf, volatile int *info,
>> plasma_barrier_t *barrier)
>> {
>> …
>> }
>>
>> /******************************************************************************/
>> // StarPU ZGETRF SPMD CPU kernel
>> static void core_starpu_cpu_zgetrf_spmd(void *desc[], void *cl_arg) {
>>
>> plasma_desc_t A;
>> int ib, mtpf, k, *piv;
>> volatile int *max_idx, *info;
>> volatile plasma_complex64_t *max_val;
>> plasma_barrier_t *barrier;
>>
>> // Unpack scalar arguments
>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &mtpf,
>> &k, &info, &barrier);
>>
>> int rank = starpu_combined_worker_get_rank();
>>
>> // Array of pointers to subdiagonal tiles in panel k (incl. diagonal
>> tile k)
>> plasma_complex64_t **pnlK =
>> (plasma_complex64_t**) malloc((size_t)A.mt *
>> sizeof(plasma_complex64_t*));
>> assert(pnlK != NULL);
>>
>> printf("Panel: %d\n", k);
>>
>> // Unpack tile data
>> for (int i = 0; i < A.mt; i++) {
>> pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
>> }
>>
>> // Unpack pivots vector
>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>
>> // Call computation kernel
>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>> ib, rank, mtpf, info, barrier);
>>
>> // Deallocate container panel
>> free(pnlK);
>> }
>>
>> /******************************************************************************/
>> // StarPU SPMD codelet
>> static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
>> {
>> .type = STARPU_SPMD,
>> .max_parallelism = 2,
>> .cpu_funcs = { core_starpu_cpu_zgetrf_spmd },
>> .cpu_funcs_name = { "zgetrf_spmd" },
>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>> };
>>
>> /******************************************************************************/
>> // StarPU task inserter
>> void core_starpu_zgetrf_spmd(plasma_desc_t A, starpu_data_handle_t hPiv,
>> volatile int *max_idx, volatile
>> plasma_complex64_t *max_val,
>> int ib, int mtpf, int k,
>> volatile int *info, plasma_barrier_t
>> *barrier) {
>>
>> // Pointer to first (top) tile in panel k
>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>
>> // Set access modes for subdiagonal tiles in panel k (incl. diagonal
>> tile k)
>> for (int i = 0; i < A.mt; i++) {
>> (pk+i)->mode = STARPU_RW;
>> }
>>
>> int retval = starpu_task_insert(
>> &core_starpu_codelet_zgetrf_spmd,
>> STARPU_VALUE, &A, sizeof(plasma_desc_t),
>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>> STARPU_RW, hPiv,
>> STARPU_VALUE, &max_idx, sizeof(volatile int*),
>> STARPU_VALUE, &max_val, sizeof(volatile
>> plasma_complex64_t*),
>> STARPU_VALUE, &ib, sizeof(int),
>> STARPU_VALUE, &mtpf, sizeof(int),
>> STARPU_VALUE, &k, sizeof(int),
>> STARPU_VALUE, &info, sizeof(volatile int*),
>> STARPU_VALUE, &barrier, sizeof(plasma_barrier_t*),
>> STARPU_NAME, "zgetrf",
>> 0);
>>
>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf:
>> starpu_task_insert() failed");
>> }
>>
>> —
>> Best wishes,
>> Maxim
>>
>> <lu_panel_fact.jpg>
>>
>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>
>>> On 24 Jan 2018, at 17:52, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>> wrote:
>>>
>>> Hello Samuel,
>>>
>>> Thank you very much! Yes, in this particular use-case “STARPU_NONE” would
>>> come handy and make the source code much more “elegant”.
>>>
>>> —
>>> Best wishes,
>>> Maxim
>>>
>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>
>>>> On 24 Jan 2018, at 17:47, Samuel Thibault <samuel.thibault@inria.fr>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
>>>>> I have a very simple question. What is the overhead of using the
>>>>> STARPU_NONE
>>>>> access mode for some handles in the STARPU_DATA_MODE_ARRAY?
>>>>
>>>> It is not implemented, we hadn't thought it could be useful. I have now
>>>> added it to the TODO list (but that list is very long and doesn't tend
>>>> to progress quickly).
>>>>
>>>> The overhead would be quite small: StarPU would just write it down in
>>>> the array of data to fetch, and just not process that element. Of course
>>>> the theoretical complexity will be O(number of data).
>>>>
>>>>> In order to avoid using complicated offsets in my computation routines
>>>>> I would like to pass them a column of matrix tiles, while setting the
>>>>> “unused” tiles to “STARPU_NONE”.
>>>>
>>>> I see.
>>>>
>>>> Samuel

Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 05/02/2018
- Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 08/02/2018
  - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 10/02/2018
    - Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
      - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 13/02/2018

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]