starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Maxim Abalenkov <maxim.abalenkov@gmail.com>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
Date: Mon, 12 Feb 2018 23:01:03 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Maxim,

Regarding the issue with 'peager' scalability, the unpatched master branch
should be similar to the 1.2.3 version. However, since 'peager' is still
considered experimental, it is probably better to switch to the master
branch, as fixes will likely arrive there first.

Best regards,
--
Olivier

> Le 12 févr. 2018 à 22:50, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
> écrit :
>
> Hello Olivier,
>
> I’m using the version 1.2.3, downloaded from the INRIA website. Would it be
> better to use the “rolling” edition? I will install it tomorrow morning!
>
> —
> Best wishes,
> Maxim
>
> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>
>> On 12 Feb 2018, at 21:46, Olivier Aumage <olivier.aumage@inria.fr> wrote:
>>
>> Hi Maxim,
>>
>> My patch was against the StarPU's master branch as of Saturday morning.
>> Which version of StarPU are you currently using?
>>
>> Best regards,
>> --
>> Olivier
>>
>>> Le 12 févr. 2018 à 16:20, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
>>> écrit :
>>>
>>> Hello Olivier,
>>>
>>> Thank you very much for your reply and the patch. I have applied the
>>> patch to the code and will re-run the experiments. I will get back to you
>>> with the results. I think one of the changes in the patch wasn’t
>>> successful. Please find below the output of the patch command and the
>>> file with the rejects. Thank you and have a good day ahead!
>>>
>>> —
>>> Best wishes,
>>> Maksims
>>>
>>> <log>
>>> <parallel_eager.c.rej>
>>>
>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>
>>>> On 10 Feb 2018, at 11:17, Olivier Aumage <olivier.aumage@inria.fr> wrote:
>>>>
>>>> Hi Maxim,
>>>>
>>>> I am not familiar with the peager implementation of StarPU (nor, I
>>>> believe, Samuel). I have had a quick look at the peager policy code, and
>>>> there seems to be an issue with the initialization phase of the policy.
>>>> Or perhaps I do not get the rationale of it...
>>>>
>>>> Can you check if the quick patch in attachment improve the scalability
>>>> of your code? You can apply it with the following command:
>>>> $ patch -p1 <../peager.patch
>>>>
>>>> This is only meant to be a temporary fix, however. I need to check with
>>>> people who wrote the code about what the initial intent was.
>>>>
>>>> Hope this helps.
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>> <peager.patch>
>>>>
>>>>> Le 8 févr. 2018 à 16:19, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
>>>>> écrit :
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I have implemented the parallel panel factorisation in LU with the
>>>>> StarPU’s SPMD capability. Here are a few answers to my own questions:
>>>>>
>>>>>> 1) Am I passing the barrier structure correctly, so that it is
>>>>>> “shared" amongst all the threads and the threads “know” about the
>>>>>> status of the other threads. To achieve this I pass the barrier
>>>>>> structure by reference.
>>>>>
>>>>> Yes, it is passed correctly. All other threads "know about” and share
>>>>> the values inside the barrier structure.
>>>>>
>>>>>> 2) Maybe it is the tile descriptors that “block” the execution of the
>>>>>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>>>>>> proceed, since all the tiles are blocked by rank 0? Therefore, I can
>>>>>> make a conclusion that “blocking” the tiles like I do is incorrect?
>>>>>
>>>>> Tile “blocking” is correct. The problem did not lie in the tile
>>>>> “blocking”, but rather in the application of a non-parallel StarPU
>>>>> scheduler. According to the StarPU handbook only two schedulers “pheft”
>>>>> and “peager” support the SPMD mode of execution.
>>>>>
>>>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>>>> “max_parallelism” value instead of hard-coding it?
>>>>>
>>>>> Since the codelet is a static structure, I am setting the maximum
>>>>> number of threads by accessing the “max_parallelism" value as follows.
>>>>> It is set right before inserting the SPMD task:
>>>>>
>>>>> // Set maximum no. of threads per panel factorisation
>>>>> core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;
>>>>>
>>>>> Please find attached a performance plot of LU factorisation (with and
>>>>> without SPMD functionality) executed on a 20 core Haswell machine. I
>>>>> believe, something goes terribly wrong since the SPMD performance
>>>>> numbers are so low. I have used the following command to execute the
>>>>> tests:
>>>>>
>>>>> export MKL_NUM_THTREADS=20
>>>>> export OMP_NUM_THTREADS=20
>>>>> export OMP_PROC_BIND=true
>>>>> export STARPU_NCPU=20
>>>>> export STARPU_SCHED=peager
>>>>> export PLASMA_TUNING_FILENAME=...
>>>>>
>>>>> numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=... —mtpf=...
>>>>> —iter=...
>>>>>
>>>>> Any insight and help in recovering the performance numbers would be
>>>>> greatly appreciated. Thank you and have a good day!
>>>>>
>>>>> —
>>>>> Best wishes,
>>>>> Maxim
>>>>>
>>>>> <haswell_dgetrf2_starpu.pdf>
>>>>>
>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>
>>>>>> On 5 Feb 2018, at 12:16, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I’m on a mission to apply the SPMD capability of the StarPU
>>>>>> (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks)
>>>>>> for a panel factorisation stage of the LU algorithm. Please see the
>>>>>> figure attached for an example of my scenario.
>>>>>>
>>>>>> The matrix is viewed as a set of tiles (rectangular or square matrix
>>>>>> blocks). A column of tiles is called a panel.
>>>>>>
>>>>>> In the first stage of the LU algorithm I would like to take a panel,
>>>>>> find the pivots, swap the necessary rows, scale and update the
>>>>>> underlying matrix elements. To track the dependencies I created tile
>>>>>> descriptors, that keep the information about the access mode and the
>>>>>> tile handle. Essentially, the tile descriptors are used to “lock” the
>>>>>> entire panel, all the operations inside are parallelised manually
>>>>>> using a custom barrier and auxiliary arrays, to store the maximum
>>>>>> values and their indices. To be able to assign a particular task to a
>>>>>> thread (processing the panel factorisation) I use ranks. Depending on
>>>>>> a rank each thread will get its portion of the data to work on. Inside
>>>>>> the panel threads are synchronised manually and wait for each other at
>>>>>> the custom barrier.
>>>>>>
>>>>>> Please pay attention to the attached figure. A panel consisting of
>>>>>> five tiles is passed to the StarPU task. Imagine we have three treads
>>>>>> processing the panel. To find the first pivot we assign the first
>>>>>> column of each tile to a certain thread in the Round-Robin manner
>>>>>> (0,1,2,0,1). Once the maximum per tile is found by each thread, the
>>>>>> master thread (with rank 0) will select the global maximum. I would
>>>>>> like to apply the SPMD capability of StarPU to process the panel and
>>>>>> use a custom barrier inside.
>>>>>>
>>>>>> Please consider the C code below. The code works, but the threads wait
>>>>>> infinitely at the first barrier. My questions are:
>>>>>>
>>>>>> 1) Am I passing the barrier structure correctly, so that it is
>>>>>> “shared" amongst all the threads and the threads “know” about the
>>>>>> status of the other threads. To achieve this I pass the barrier
>>>>>> structure by reference.
>>>>>> 2) Maybe it is the tile descriptors that “block” the execution of the
>>>>>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>>>>>> proceed, since all the tiles are blocked by rank 0? Therefore, I can
>>>>>> make a conclusion that “blocking” the tiles like I do is incorrect?
>>>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>>>> “max_parallelism” value instead of hard-coding it?
>>>>>>
>>>>>> 4) If I may, I would like to make a general comment, please. I like
>>>>>> StarPU very much. I think you have invested a great deal of time and
>>>>>> effort into it. Thank you. But to my mind the weakest point (from my
>>>>>> user experience) is passing the values to StarPU, while inserting a
>>>>>> task. There is no type checking of the variables here. The same
>>>>>> applies to the routine “starpu_codelet_unpack_args()”, when you want
>>>>>> to obtain the values “on the other side”. Sometimes, it becomes a
>>>>>> nightmare and a trial-and-error exercise. If the type checks could be
>>>>>> enforced there, it would make a user’s life much easier.
>>>>>>
>>>>>> // StarPU LU panel factorisation function
>>>>>> /******************************************************************************/
>>>>>> void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int *piv,
>>>>>> volatile int *max_idx, volatile plasma_complex64_t
>>>>>> *max_val,
>>>>>> int ib, int rank, int mtpf, volatile int *info,
>>>>>> plasma_barrier_t *barrier)
>>>>>> {
>>>>>> …
>>>>>> }
>>>>>>
>>>>>> /******************************************************************************/
>>>>>> // StarPU ZGETRF SPMD CPU kernel
>>>>>> static void core_starpu_cpu_zgetrf_spmd(void *desc[], void *cl_arg) {
>>>>>>
>>>>>> plasma_desc_t A;
>>>>>> int ib, mtpf, k, *piv;
>>>>>> volatile int *max_idx, *info;
>>>>>> volatile plasma_complex64_t *max_val;
>>>>>> plasma_barrier_t *barrier;
>>>>>>
>>>>>> // Unpack scalar arguments
>>>>>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib,
>>>>>> &mtpf,
>>>>>> &k, &info, &barrier);
>>>>>>
>>>>>> int rank = starpu_combined_worker_get_rank();
>>>>>>
>>>>>> // Array of pointers to subdiagonal tiles in panel k (incl. diagonal
>>>>>> tile k)
>>>>>> plasma_complex64_t **pnlK =
>>>>>> (plasma_complex64_t**) malloc((size_t)A.mt *
>>>>>> sizeof(plasma_complex64_t*));
>>>>>> assert(pnlK != NULL);
>>>>>>
>>>>>> printf("Panel: %d\n", k);
>>>>>>
>>>>>> // Unpack tile data
>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>> pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
>>>>>> }
>>>>>>
>>>>>> // Unpack pivots vector
>>>>>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>>>>>
>>>>>> // Call computation kernel
>>>>>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>>>>>> ib, rank, mtpf, info, barrier);
>>>>>>
>>>>>> // Deallocate container panel
>>>>>> free(pnlK);
>>>>>> }
>>>>>>
>>>>>> /******************************************************************************/
>>>>>> // StarPU SPMD codelet
>>>>>> static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
>>>>>> {
>>>>>> .type = STARPU_SPMD,
>>>>>> .max_parallelism = 2,
>>>>>> .cpu_funcs = { core_starpu_cpu_zgetrf_spmd },
>>>>>> .cpu_funcs_name = { "zgetrf_spmd" },
>>>>>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>>>>>> };
>>>>>>
>>>>>> /******************************************************************************/
>>>>>> // StarPU task inserter
>>>>>> void core_starpu_zgetrf_spmd(plasma_desc_t A, starpu_data_handle_t
>>>>>> hPiv,
>>>>>> volatile int *max_idx, volatile
>>>>>> plasma_complex64_t *max_val,
>>>>>> int ib, int mtpf, int k,
>>>>>> volatile int *info, plasma_barrier_t
>>>>>> *barrier) {
>>>>>>
>>>>>> // Pointer to first (top) tile in panel k
>>>>>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>>>>>
>>>>>> // Set access modes for subdiagonal tiles in panel k (incl. diagonal
>>>>>> tile k)
>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>> (pk+i)->mode = STARPU_RW;
>>>>>> }
>>>>>>
>>>>>> int retval = starpu_task_insert(
>>>>>> &core_starpu_codelet_zgetrf_spmd,
>>>>>> STARPU_VALUE, &A, sizeof(plasma_desc_t),
>>>>>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>>>>>> STARPU_RW, hPiv,
>>>>>> STARPU_VALUE, &max_idx, sizeof(volatile int*),
>>>>>> STARPU_VALUE, &max_val, sizeof(volatile
>>>>>> plasma_complex64_t*),
>>>>>> STARPU_VALUE, &ib, sizeof(int),
>>>>>> STARPU_VALUE, &mtpf, sizeof(int),
>>>>>> STARPU_VALUE, &k, sizeof(int),
>>>>>> STARPU_VALUE, &info, sizeof(volatile int*),
>>>>>> STARPU_VALUE, &barrier,
>>>>>> sizeof(plasma_barrier_t*),
>>>>>> STARPU_NAME, "zgetrf",
>>>>>> 0);
>>>>>>
>>>>>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf:
>>>>>> starpu_task_insert() failed");
>>>>>> }
>>>>>>
>>>>>> —
>>>>>> Best wishes,
>>>>>> Maxim
>>>>>>
>>>>>> <lu_panel_fact.jpg>
>>>>>>
>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>
>>>>>>> On 24 Jan 2018, at 17:52, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hello Samuel,
>>>>>>>
>>>>>>> Thank you very much! Yes, in this particular use-case “STARPU_NONE”
>>>>>>> would come handy and make the source code much more “elegant”.
>>>>>>>
>>>>>>> —
>>>>>>> Best wishes,
>>>>>>> Maxim
>>>>>>>
>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>
>>>>>>>> On 24 Jan 2018, at 17:47, Samuel Thibault <samuel.thibault@inria.fr>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
>>>>>>>>> I have a very simple question. What is the overhead of using the
>>>>>>>>> STARPU_NONE
>>>>>>>>> access mode for some handles in the STARPU_DATA_MODE_ARRAY?
>>>>>>>>
>>>>>>>> It is not implemented, we hadn't thought it could be useful. I have
>>>>>>>> now
>>>>>>>> added it to the TODO list (but that list is very long and doesn't
>>>>>>>> tend
>>>>>>>> to progress quickly).
>>>>>>>>
>>>>>>>> The overhead would be quite small: StarPU would just write it down in
>>>>>>>> the array of data to fetch, and just not process that element. Of
>>>>>>>> course
>>>>>>>> the theoretical complexity will be O(number of data).
>>>>>>>>
>>>>>>>>> In order to avoid using complicated offsets in my computation
>>>>>>>>> routines
>>>>>>>>> I would like to pass them a column of matrix tiles, while setting
>>>>>>>>> the
>>>>>>>>> “unused” tiles to “STARPU_NONE”.
>>>>>>>>
>>>>>>>> I see.
>>>>>>>>
>>>>>>>> Samuel
>>>>
>>>
>>
>

Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 05/02/2018
- Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 08/02/2018
  - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 10/02/2018
    - Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
      - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 13/02/2018

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]