starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Maxim Abalenkov <maxim.abalenkov@gmail.com>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
Date: Mon, 12 Feb 2018 22:46:19 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Maxim,

My patch was against the StarPU's master branch as of Saturday morning. Which
version of StarPU are you currently using?

Best regards,
--
Olivier

> Le 12 févr. 2018 à 16:20, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
> écrit :
>
> Hello Olivier,
>
> Thank you very much for your reply and the patch. I have applied the patch
> to the code and will re-run the experiments. I will get back to you with
> the results. I think one of the changes in the patch wasn’t successful.
> Please find below the output of the patch command and the file with the
> rejects. Thank you and have a good day ahead!
>
> —
> Best wishes,
> Maksims
>
> <log>
> <parallel_eager.c.rej>
>
> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>
>> On 10 Feb 2018, at 11:17, Olivier Aumage <olivier.aumage@inria.fr> wrote:
>>
>> Hi Maxim,
>>
>> I am not familiar with the peager implementation of StarPU (nor, I
>> believe, Samuel). I have had a quick look at the peager policy code, and
>> there seems to be an issue with the initialization phase of the policy. Or
>> perhaps I do not get the rationale of it...
>>
>> Can you check if the quick patch in attachment improve the scalability of
>> your code? You can apply it with the following command:
>> $ patch -p1 <../peager.patch
>>
>> This is only meant to be a temporary fix, however. I need to check with
>> people who wrote the code about what the initial intent was.
>>
>> Hope this helps.
>> Best regards,
>> --
>> Olivier
>>
>> <peager.patch>
>>
>>> Le 8 févr. 2018 à 16:19, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
>>> écrit :
>>>
>>> Dear all,
>>>
>>> I have implemented the parallel panel factorisation in LU with the
>>> StarPU’s SPMD capability. Here are a few answers to my own questions:
>>>
>>>> 1) Am I passing the barrier structure correctly, so that it is “shared"
>>>> amongst all the threads and the threads “know” about the status of the
>>>> other threads. To achieve this I pass the barrier structure by reference.
>>>
>>> Yes, it is passed correctly. All other threads "know about” and share the
>>> values inside the barrier structure.
>>>
>>>> 2) Maybe it is the tile descriptors that “block” the execution of the
>>>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>>>> proceed, since all the tiles are blocked by rank 0? Therefore, I can
>>>> make a conclusion that “blocking” the tiles like I do is incorrect?
>>>
>>> Tile “blocking” is correct. The problem did not lie in the tile
>>> “blocking”, but rather in the application of a non-parallel StarPU
>>> scheduler. According to the StarPU handbook only two schedulers “pheft”
>>> and “peager” support the SPMD mode of execution.
>>>
>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>> “max_parallelism” value instead of hard-coding it?
>>>
>>> Since the codelet is a static structure, I am setting the maximum number
>>> of threads by accessing the “max_parallelism" value as follows. It is set
>>> right before inserting the SPMD task:
>>>
>>> // Set maximum no. of threads per panel factorisation
>>> core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;
>>>
>>> Please find attached a performance plot of LU factorisation (with and
>>> without SPMD functionality) executed on a 20 core Haswell machine. I
>>> believe, something goes terribly wrong since the SPMD performance numbers
>>> are so low. I have used the following command to execute the tests:
>>>
>>> export MKL_NUM_THTREADS=20
>>> export OMP_NUM_THTREADS=20
>>> export OMP_PROC_BIND=true
>>> export STARPU_NCPU=20
>>> export STARPU_SCHED=peager
>>> export PLASMA_TUNING_FILENAME=...
>>>
>>> numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=... —mtpf=...
>>> —iter=...
>>>
>>> Any insight and help in recovering the performance numbers would be
>>> greatly appreciated. Thank you and have a good day!
>>>
>>> —
>>> Best wishes,
>>> Maxim
>>>
>>> <haswell_dgetrf2_starpu.pdf>
>>>
>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>
>>>> On 5 Feb 2018, at 12:16, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I’m on a mission to apply the SPMD capability of the StarPU
>>>> (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks)
>>>> for a panel factorisation stage of the LU algorithm. Please see the
>>>> figure attached for an example of my scenario.
>>>>
>>>> The matrix is viewed as a set of tiles (rectangular or square matrix
>>>> blocks). A column of tiles is called a panel.
>>>>
>>>> In the first stage of the LU algorithm I would like to take a panel,
>>>> find the pivots, swap the necessary rows, scale and update the
>>>> underlying matrix elements. To track the dependencies I created tile
>>>> descriptors, that keep the information about the access mode and the
>>>> tile handle. Essentially, the tile descriptors are used to “lock” the
>>>> entire panel, all the operations inside are parallelised manually using
>>>> a custom barrier and auxiliary arrays, to store the maximum values and
>>>> their indices. To be able to assign a particular task to a thread
>>>> (processing the panel factorisation) I use ranks. Depending on a rank
>>>> each thread will get its portion of the data to work on. Inside the
>>>> panel threads are synchronised manually and wait for each other at the
>>>> custom barrier.
>>>>
>>>> Please pay attention to the attached figure. A panel consisting of five
>>>> tiles is passed to the StarPU task. Imagine we have three treads
>>>> processing the panel. To find the first pivot we assign the first column
>>>> of each tile to a certain thread in the Round-Robin manner (0,1,2,0,1).
>>>> Once the maximum per tile is found by each thread, the master thread
>>>> (with rank 0) will select the global maximum. I would like to apply the
>>>> SPMD capability of StarPU to process the panel and use a custom barrier
>>>> inside.
>>>>
>>>> Please consider the C code below. The code works, but the threads wait
>>>> infinitely at the first barrier. My questions are:
>>>>
>>>> 1) Am I passing the barrier structure correctly, so that it is “shared"
>>>> amongst all the threads and the threads “know” about the status of the
>>>> other threads. To achieve this I pass the barrier structure by reference.
>>>> 2) Maybe it is the tile descriptors that “block” the execution of the
>>>> threads inside the panel? Maybe the threads with ranks 1, 2 can not
>>>> proceed, since all the tiles are blocked by rank 0? Therefore, I can
>>>> make a conclusion that “blocking” the tiles like I do is incorrect?
>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>> “max_parallelism” value instead of hard-coding it?
>>>>
>>>> 4) If I may, I would like to make a general comment, please. I like
>>>> StarPU very much. I think you have invested a great deal of time and
>>>> effort into it. Thank you. But to my mind the weakest point (from my
>>>> user experience) is passing the values to StarPU, while inserting a
>>>> task. There is no type checking of the variables here. The same applies
>>>> to the routine “starpu_codelet_unpack_args()”, when you want to obtain
>>>> the values “on the other side”. Sometimes, it becomes a nightmare and a
>>>> trial-and-error exercise. If the type checks could be enforced there, it
>>>> would make a user’s life much easier.
>>>>
>>>> // StarPU LU panel factorisation function
>>>> /******************************************************************************/
>>>> void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int *piv,
>>>> volatile int *max_idx, volatile plasma_complex64_t
>>>> *max_val,
>>>> int ib, int rank, int mtpf, volatile int *info,
>>>> plasma_barrier_t *barrier)
>>>> {
>>>> …
>>>> }
>>>>
>>>> /******************************************************************************/
>>>> // StarPU ZGETRF SPMD CPU kernel
>>>> static void core_starpu_cpu_zgetrf_spmd(void *desc[], void *cl_arg) {
>>>>
>>>> plasma_desc_t A;
>>>> int ib, mtpf, k, *piv;
>>>> volatile int *max_idx, *info;
>>>> volatile plasma_complex64_t *max_val;
>>>> plasma_barrier_t *barrier;
>>>>
>>>> // Unpack scalar arguments
>>>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &mtpf,
>>>> &k, &info, &barrier);
>>>>
>>>> int rank = starpu_combined_worker_get_rank();
>>>>
>>>> // Array of pointers to subdiagonal tiles in panel k (incl. diagonal
>>>> tile k)
>>>> plasma_complex64_t **pnlK =
>>>> (plasma_complex64_t**) malloc((size_t)A.mt *
>>>> sizeof(plasma_complex64_t*));
>>>> assert(pnlK != NULL);
>>>>
>>>> printf("Panel: %d\n", k);
>>>>
>>>> // Unpack tile data
>>>> for (int i = 0; i < A.mt; i++) {
>>>> pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
>>>> }
>>>>
>>>> // Unpack pivots vector
>>>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>>>
>>>> // Call computation kernel
>>>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>>>> ib, rank, mtpf, info, barrier);
>>>>
>>>> // Deallocate container panel
>>>> free(pnlK);
>>>> }
>>>>
>>>> /******************************************************************************/
>>>> // StarPU SPMD codelet
>>>> static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
>>>> {
>>>> .type = STARPU_SPMD,
>>>> .max_parallelism = 2,
>>>> .cpu_funcs = { core_starpu_cpu_zgetrf_spmd },
>>>> .cpu_funcs_name = { "zgetrf_spmd" },
>>>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>>>> };
>>>>
>>>> /******************************************************************************/
>>>> // StarPU task inserter
>>>> void core_starpu_zgetrf_spmd(plasma_desc_t A, starpu_data_handle_t hPiv,
>>>> volatile int *max_idx, volatile
>>>> plasma_complex64_t *max_val,
>>>> int ib, int mtpf, int k,
>>>> volatile int *info, plasma_barrier_t
>>>> *barrier) {
>>>>
>>>> // Pointer to first (top) tile in panel k
>>>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>>>
>>>> // Set access modes for subdiagonal tiles in panel k (incl. diagonal
>>>> tile k)
>>>> for (int i = 0; i < A.mt; i++) {
>>>> (pk+i)->mode = STARPU_RW;
>>>> }
>>>>
>>>> int retval = starpu_task_insert(
>>>> &core_starpu_codelet_zgetrf_spmd,
>>>> STARPU_VALUE, &A, sizeof(plasma_desc_t),
>>>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>>>> STARPU_RW, hPiv,
>>>> STARPU_VALUE, &max_idx, sizeof(volatile int*),
>>>> STARPU_VALUE, &max_val, sizeof(volatile
>>>> plasma_complex64_t*),
>>>> STARPU_VALUE, &ib, sizeof(int),
>>>> STARPU_VALUE, &mtpf, sizeof(int),
>>>> STARPU_VALUE, &k, sizeof(int),
>>>> STARPU_VALUE, &info, sizeof(volatile int*),
>>>> STARPU_VALUE, &barrier, sizeof(plasma_barrier_t*),
>>>> STARPU_NAME, "zgetrf",
>>>> 0);
>>>>
>>>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf:
>>>> starpu_task_insert() failed");
>>>> }
>>>>
>>>> —
>>>> Best wishes,
>>>> Maxim
>>>>
>>>> <lu_panel_fact.jpg>
>>>>
>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>
>>>>> On 24 Jan 2018, at 17:52, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello Samuel,
>>>>>
>>>>> Thank you very much! Yes, in this particular use-case “STARPU_NONE”
>>>>> would come handy and make the source code much more “elegant”.
>>>>>
>>>>> —
>>>>> Best wishes,
>>>>> Maxim
>>>>>
>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>
>>>>>> On 24 Jan 2018, at 17:47, Samuel Thibault <samuel.thibault@inria.fr>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
>>>>>>> I have a very simple question. What is the overhead of using the
>>>>>>> STARPU_NONE
>>>>>>> access mode for some handles in the STARPU_DATA_MODE_ARRAY?
>>>>>>
>>>>>> It is not implemented, we hadn't thought it could be useful. I have now
>>>>>> added it to the TODO list (but that list is very long and doesn't tend
>>>>>> to progress quickly).
>>>>>>
>>>>>> The overhead would be quite small: StarPU would just write it down in
>>>>>> the array of data to fetch, and just not process that element. Of
>>>>>> course
>>>>>> the theoretical complexity will be O(number of data).
>>>>>>
>>>>>>> In order to avoid using complicated offsets in my computation routines
>>>>>>> I would like to pass them a column of matrix tiles, while setting the
>>>>>>> “unused” tiles to “STARPU_NONE”.
>>>>>>
>>>>>> I see.
>>>>>>
>>>>>> Samuel
>>
>

Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 05/02/2018
- Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 08/02/2018
  - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 10/02/2018
    - Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
      - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 12/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 13/02/2018

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]