starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]

From: Olivier Aumage <olivier.aumage@inria.fr>
To: Maxim Abalenkov <maxim.abalenkov@gmail.com>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
Date: Fri, 23 Feb 2018 16:12:03 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Maxim,

What do you obtain when you try to run the "vector_scal_spmd" example program
of StarPU on your machine with the following two command lines ?:

STARPU_SINGLE_COMBINED_WORKER=0 STARPU_SCHED=peager
./install-trunk-cpu/lib/starpu/examples/vector_scal_spmd

STARPU_SINGLE_COMBINED_WORKER=1 STARPU_SCHED=peager
./install-trunk-cpu/lib/starpu/examples/vector_scal_spmd

The first one should run parallel tasks of varying sizes, while the second
one should run tasks at max parallel size only. This is what I obtain at this
time on our machine.

Best regards,
--
Olivier

> Le 23 févr. 2018 à 15:16, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
> écrit :
>
> Hello Olivier,
>
> I have been testing my code with the new “peager” for a while now. I think,
> there might be a mistake in the policy implementation. My LU code “hangs”
> using peager and multiple coworker threads. Thank you.
>
> —
> Best wishes,
> Maxim
>
> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>
>> On 23 Feb 2018, at 12:29, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>> wrote:
>>
>> Hello Olivier,
>>
>> Thank you very much for your email. I have “un-applied” your patch and
>> “pulled" the new version of the “peager” implementation. Unfortunately, it
>> seems setting the “max_parallelism” attribute of the static codelet struct
>> is no longer working, i.e. prior to inserting a new task I have specified
>> this value as:
>>
>> codelet_spmd.max_parallelism = N;
>>
>> Then in the “CPU” function I call:
>>
>> int n = starpu_combined_worker_get_size();
>>
>> However, the value returned is always 1.
>>
>> —
>> Best wishes,
>> Maxim
>>
>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>
>>> On 22 Feb 2018, at 14:54, Olivier Aumage <olivier.aumage@inria.fr> wrote:
>>>
>>> Hi Maxim,
>>>
>>> I have implemented a new algorithm for the Parallel Eager scheduling
>>> policy earlier this week. It is available in the master/ branch and the
>>> starpu-1.3/ branch. It replaces the previous "peager" implementation.
>>>
>>> It is not meant to be much clever, since this is still a greedy
>>> scheduler, but it is able to schedule parallel tasks while also
>>> scheduling tasks in parallel.
>>>
>>> You can perhaps use it as a basis for a scheduler more customized to your
>>> needs, either by acting on some parameters (combined workers min and max
>>> size, codelet's max parallelism) or even by copying and extending it.
>>>
>>> There is indeed some very application-dependent trade-off to find between
>>> favoring intra-task parallelism vs inter-task parallelism. Favoring
>>> intra-task parallelism means that the scheduler will try harder to build
>>> a parallel team of workers to work on a single task (but potentially
>>> increasing idle time while building such a team). On the contrary,
>>> favoring inter-task parallelism means that the scheduler will try to
>>> schedule tasks ASAP, even if only small parallel teams are available.
>>>
>>> Note: you should remove the temporary patch I sent earlier, before
>>> pulling the new implementation.
>>>
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>> Le 13 févr. 2018 à 23:14, Olivier Aumage <olivier.aumage@inria.fr> a
>>>> écrit :
>>>>
>>>> Hi Maxim,
>>>>
>>>> In the Fork-Join mode, StarPU indeed only launches the task on the
>>>> master thread of the team. This is the intended behavior of this mode,
>>>> because it is designed to be used for tasks that launch their own
>>>> computing threads. This is the case, for instance, when the parallel
>>>> task is a kernel from a "black-box" parallel library, or a kernel
>>>> written with some third party OpenMP. If you have a look at Section
>>>> "6.10.1 Fork-mode Parallel Tasks" from the StarPU manual, you will see
>>>> that the example given is an OpenMP kernel.
>>>>
>>>> Moreover, in this Fork-Join mode, StarPU sets a CPU mask just before
>>>> starting the parallel task on the master thread. Then the parallel task
>>>> can use this mask to know how many computing threads to launch, and on
>>>> which CPU cores they should be bound.
>>>>
>>>> Best regards,
>>>> --
>>>> Olivier
>>>>
>>>>> Le 13 févr. 2018 à 22:44, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
>>>>> écrit :
>>>>>
>>>>> Dear Olivier,
>>>>>
>>>>> Thank you very much for your help and explanation. Please find attached
>>>>> an updated performance plot. I have included the data from application
>>>>> of the “lws" scheduling policy and 1 thread per panel factorisation. It
>>>>> is interesting to note, that the performance of the algorithm under the
>>>>> “lws" almost reaches the performance of “peager”. Which shows that
>>>>> “peager” is potentially suboptimal.
>>>>>
>>>>> I have also tried to experiment with the Fork--Join approach and
>>>>> “peager", but unfortunately without any success. The code hangs and
>>>>> does not proceed. From my own debugging it seems at the first
>>>>> iteration, for the first panel factorisation, only a master thread is
>>>>> launched instead of 4 threads. For the subsequent panels all 4 threads
>>>>> enter the panel factorisation kernel, the “core” routine, but then they
>>>>> hang. The same happens when I use another parallel scheduling policy
>>>>> “pheft”.
>>>>>
>>>>> The StarPU functions that I use to test the Fork--Join approach are
>>>>> given below:
>>>>>
>>>>> /******************************************************************************/
>>>>> // StarPU ZGETRF CPU kernel (Fork-Join)
>>>>> static void core_starpu_cpu_zgetrf_fj(void *desc[], void *cl_arg) {
>>>>>
>>>>> plasma_desc_t A;
>>>>> int ib, k, *piv;
>>>>> volatile int *max_idx, *info;
>>>>> volatile plasma_complex64_t *max_val;
>>>>> plasma_barrier_t *barrier;
>>>>>
>>>>> // Unpack scalar arguments
>>>>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &k,
>>>>> &info, &barrier);
>>>>>
>>>>> int mtpf = starpu_combined_worker_get_size();
>>>>>
>>>>> // Array of pointers to subdiagonal tiles in panel k (incl. diagonal
>>>>> tile k)
>>>>> plasma_complex64_t **pnlK =
>>>>> (plasma_complex64_t**) malloc((size_t)A.mt *
>>>>> sizeof(plasma_complex64_t*));
>>>>> assert(pnlK != NULL);
>>>>>
>>>>> // Unpack tile data
>>>>> for (int i = 0; i < A.mt; i++) {
>>>>> pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
>>>>> }
>>>>>
>>>>> // Unpack pivots vector
>>>>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>>>>
>>>>> // Call computation kernel
>>>>> #pragma omp parallel
>>>>> #pragma omp master
>>>>> {
>>>>> #pragma omp taskloop untied shared(barrier) num_tasks(mtpf)
>>>>> priority(2)
>>>>> for (int rank = 0; rank < mtpf; rank++) {
>>>>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>>>>> ib, rank, mtpf, info, barrier);
>>>>> }
>>>>> }
>>>>>
>>>>> // Deallocate container panel
>>>>> free(pnlK);
>>>>> }
>>>>>
>>>>> /******************************************************************************/
>>>>> // StarPU codelet (Fork-Join)
>>>>> static struct starpu_codelet core_starpu_codelet_zgetrf_fj =
>>>>> {
>>>>> .where = STARPU_CPU,
>>>>> .type = STARPU_FORKJOIN,
>>>>> .cpu_funcs = { core_starpu_cpu_zgetrf_fj },
>>>>> .cpu_funcs_name = { "zgetrf_fj" },
>>>>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>>>>> .name = "zgetrf_cl_fj"
>>>>> };
>>>>>
>>>>> /******************************************************************************/
>>>>> // StarPU task inserter (Fork-Join)
>>>>> void core_starpu_zgetrf_fj(plasma_desc_t A, starpu_data_handle_t hPiv,
>>>>> volatile int *max_idx, volatile
>>>>> plasma_complex64_t *max_val,
>>>>> int ib, int mtpf, int k, int prio,
>>>>> volatile int *info, plasma_barrier_t *barrier)
>>>>> {
>>>>>
>>>>> // Set maximum no. of threads per panel factorisation
>>>>> core_starpu_codelet_zgetrf_fj.max_parallelism = mtpf;
>>>>>
>>>>> // Pointer to first (top) tile in panel k
>>>>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>>>>
>>>>> // Set access modes for subdiagonal tiles in panel k (incl. diagonal
>>>>> tile k)
>>>>> for (int i = 0; i < A.mt; i++) {
>>>>> (pk+i)->mode = STARPU_RW;
>>>>> }
>>>>>
>>>>> int retval = starpu_task_insert(
>>>>> &core_starpu_codelet_zgetrf_fj,
>>>>> STARPU_VALUE, &A, sizeof(plasma_desc_t),
>>>>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>>>>> STARPU_RW, hPiv,
>>>>> STARPU_VALUE, &max_idx, sizeof(volatile int*),
>>>>> STARPU_VALUE, &max_val, sizeof(volatile
>>>>> plasma_complex64_t*),
>>>>> STARPU_VALUE, &ib, sizeof(int),
>>>>> STARPU_VALUE, &k, sizeof(int),
>>>>> STARPU_VALUE, &info, sizeof(volatile int*),
>>>>> STARPU_VALUE, &barrier, sizeof(plasma_barrier_t*),
>>>>> STARPU_NAME, "zgetrf_fj",
>>>>> 0);
>>>>>
>>>>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf_fj:
>>>>> starpu_task_insert() failed");
>>>>> }
>>>>>
>>>>> —
>>>>> Best wishes,
>>>>> Maxim
>>>>>
>>>>> <haswell_dgetrf_starpu_spmd_lws.pdf>
>>>>>
>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>
>>>>>> On 13 Feb 2018, at 18:09, Olivier Aumage <olivier.aumage@inria.fr>
>>>>>> wrote:
>>>>>>
>>>>>> [missing forward to the list]
>>>>>>
>>>>>>> Début du message réexpédié :
>>>>>>>
>>>>>>> De: Olivier Aumage <olivier.aumage@inria.fr>
>>>>>>> Objet: Rép : [Starpu-devel] [LU factorisation: gdb debug output]
>>>>>>> Date: 13 février 2018 à 19:06:01 UTC+1
>>>>>>> À: Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>>>>>
>>>>>>> Hi Maxim,
>>>>>>>
>>>>>>> It is actually expected that the patch benefit is low.
>>>>>>>
>>>>>>> The main issue with 'peager' is that the initialization phase builds
>>>>>>> some table indicating, for each worker, who is the master of its
>>>>>>> parallel team. However, this is iterated, for teams of increasing
>>>>>>> sizes up to the team containing all workers. Thus, every worker ends
>>>>>>> up being assigned the worker 0 as master worker. The result is that
>>>>>>> only the worker 0 fetches parallel tasks in the unpatched version,
>>>>>>> and the tasks are therefore serialized with respect to each other.
>>>>>>> This is why you obtained a flat scalability plot with that version.
>>>>>>>
>>>>>>> The small patch I sent simply limit the size of the worker teams, to
>>>>>>> avoid having every worker to be under the control of worker 0. I put
>>>>>>> an arbitrary limit of 4 workers per group in the patch.
>>>>>>>
>>>>>>> Of course, this is only temporary. I am away from office this week,
>>>>>>> and I need to check with some colleagues about why the code was
>>>>>>> organized this way, and if, perhaps, the peager implementation made
>>>>>>> some assumptions about some other parts of the StarPU core that are
>>>>>>> no longer true.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> --
>>>>>>> Olivier
>>>>>>>
>>>>>>>> Le 13 févr. 2018 à 17:36, Maxim Abalenkov
>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>
>>>>>>>> Dear Olivier,
>>>>>>>>
>>>>>>>> Please find attached a plot of my experiments with various no. of
>>>>>>>> SPMD threads working on the LU panel factorisation. Using more
>>>>>>>> threads is beneficial, but unfortunately, the benefit is miniscule.
>>>>>>>> I have also implemented the Fork—Join approach wrapping around the
>>>>>>>> panel factorisation done by OpenMP. I will show you the results
>>>>>>>> soon. Thank you very much for your help!
>>>>>>>>
>>>>>>>> —
>>>>>>>> Best wishes,
>>>>>>>> Maxim
>>>>>>>>
>>>>>>>> <haswell_dgetrf_starpu_spmd.pdf>
>>>>>>>>
>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>
>>>>>>>>> On 12 Feb 2018, at 22:01, Olivier Aumage <olivier.aumage@inria.fr>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Maxim,
>>>>>>>>>
>>>>>>>>> Regarding the issue with 'peager' scalability, the unpatched master
>>>>>>>>> branch should be similar to the 1.2.3 version. However, since
>>>>>>>>> 'peager' is still considered experimental, it is probably better to
>>>>>>>>> switch to the master branch, as fixes will likely arrive there
>>>>>>>>> first.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> --
>>>>>>>>> Olivier
>>>>>>>>>
>>>>>>>>>> Le 12 févr. 2018 à 22:50, Maxim Abalenkov
>>>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>>>
>>>>>>>>>> Hello Olivier,
>>>>>>>>>>
>>>>>>>>>> I’m using the version 1.2.3, downloaded from the INRIA website.
>>>>>>>>>> Would it be better to use the “rolling” edition? I will install it
>>>>>>>>>> tomorrow morning!
>>>>>>>>>>
>>>>>>>>>> —
>>>>>>>>>> Best wishes,
>>>>>>>>>> Maxim
>>>>>>>>>>
>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>
>>>>>>>>>>> On 12 Feb 2018, at 21:46, Olivier Aumage
>>>>>>>>>>> <olivier.aumage@inria.fr> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Maxim,
>>>>>>>>>>>
>>>>>>>>>>> My patch was against the StarPU's master branch as of Saturday
>>>>>>>>>>> morning. Which version of StarPU are you currently using?
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> --
>>>>>>>>>>> Olivier
>>>>>>>>>>>
>>>>>>>>>>>> Le 12 févr. 2018 à 16:20, Maxim Abalenkov
>>>>>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Olivier,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you very much for your reply and the patch. I have applied
>>>>>>>>>>>> the patch to the code and will re-run the experiments. I will
>>>>>>>>>>>> get back to you with the results. I think one of the changes in
>>>>>>>>>>>> the patch wasn’t successful. Please find below the output of the
>>>>>>>>>>>> patch command and the file with the rejects. Thank you and have
>>>>>>>>>>>> a good day ahead!
>>>>>>>>>>>>
>>>>>>>>>>>> —
>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>> Maksims
>>>>>>>>>>>>
>>>>>>>>>>>> <log>
>>>>>>>>>>>> <parallel_eager.c.rej>
>>>>>>>>>>>>
>>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>>
>>>>>>>>>>>>> On 10 Feb 2018, at 11:17, Olivier Aumage
>>>>>>>>>>>>> <olivier.aumage@inria.fr> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Maxim,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not familiar with the peager implementation of StarPU
>>>>>>>>>>>>> (nor, I believe, Samuel). I have had a quick look at the peager
>>>>>>>>>>>>> policy code, and there seems to be an issue with the
>>>>>>>>>>>>> initialization phase of the policy. Or perhaps I do not get the
>>>>>>>>>>>>> rationale of it...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you check if the quick patch in attachment improve the
>>>>>>>>>>>>> scalability of your code? You can apply it with the following
>>>>>>>>>>>>> command:
>>>>>>>>>>>>> $ patch -p1 <../peager.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is only meant to be a temporary fix, however. I need to
>>>>>>>>>>>>> check with people who wrote the code about what the initial
>>>>>>>>>>>>> intent was.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hope this helps.
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Olivier
>>>>>>>>>>>>>
>>>>>>>>>>>>> <peager.patch>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 8 févr. 2018 à 16:19, Maxim Abalenkov
>>>>>>>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have implemented the parallel panel factorisation in LU with
>>>>>>>>>>>>>> the StarPU’s SPMD capability. Here are a few answers to my own
>>>>>>>>>>>>>> questions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Am I passing the barrier structure correctly, so that it
>>>>>>>>>>>>>>> is “shared" amongst all the threads and the threads “know”
>>>>>>>>>>>>>>> about the status of the other threads. To achieve this I pass
>>>>>>>>>>>>>>> the barrier structure by reference.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, it is passed correctly. All other threads "know about”
>>>>>>>>>>>>>> and share the values inside the barrier structure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) Maybe it is the tile descriptors that “block” the
>>>>>>>>>>>>>>> execution of the threads inside the panel? Maybe the threads
>>>>>>>>>>>>>>> with ranks 1, 2 can not proceed, since all the tiles are
>>>>>>>>>>>>>>> blocked by rank 0? Therefore, I can make a conclusion that
>>>>>>>>>>>>>>> “blocking” the tiles like I do is incorrect?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tile “blocking” is correct. The problem did not lie in the
>>>>>>>>>>>>>> tile “blocking”, but rather in the application of a
>>>>>>>>>>>>>> non-parallel StarPU scheduler. According to the StarPU
>>>>>>>>>>>>>> handbook only two schedulers “pheft” and “peager” support the
>>>>>>>>>>>>>> SPMD mode of execution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) Is there a way to pass a variable to the codelet to set
>>>>>>>>>>>>>>> the “max_parallelism” value instead of hard-coding it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since the codelet is a static structure, I am setting the
>>>>>>>>>>>>>> maximum number of threads by accessing the “max_parallelism"
>>>>>>>>>>>>>> value as follows. It is set right before inserting the SPMD
>>>>>>>>>>>>>> task:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // Set maximum no. of threads per panel factorisation
>>>>>>>>>>>>>> core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please find attached a performance plot of LU factorisation
>>>>>>>>>>>>>> (with and without SPMD functionality) executed on a 20 core
>>>>>>>>>>>>>> Haswell machine. I believe, something goes terribly wrong
>>>>>>>>>>>>>> since the SPMD performance numbers are so low. I have used the
>>>>>>>>>>>>>> following command to execute the tests:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> export MKL_NUM_THTREADS=20
>>>>>>>>>>>>>> export OMP_NUM_THTREADS=20
>>>>>>>>>>>>>> export OMP_PROC_BIND=true
>>>>>>>>>>>>>> export STARPU_NCPU=20
>>>>>>>>>>>>>> export STARPU_SCHED=peager
>>>>>>>>>>>>>> export PLASMA_TUNING_FILENAME=...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=...
>>>>>>>>>>>>>> —mtpf=... —iter=...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any insight and help in recovering the performance numbers
>>>>>>>>>>>>>> would be greatly appreciated. Thank you and have a good day!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> —
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>> Maxim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <haswell_dgetrf2_starpu.pdf>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 5 Feb 2018, at 12:16, Maxim Abalenkov
>>>>>>>>>>>>>>> <maxim.abalenkov@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m on a mission to apply the SPMD capability of the StarPU
>>>>>>>>>>>>>>> (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks)
>>>>>>>>>>>>>>> for a panel factorisation stage of the LU algorithm. Please
>>>>>>>>>>>>>>> see the figure attached for an example of my scenario.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The matrix is viewed as a set of tiles (rectangular or square
>>>>>>>>>>>>>>> matrix blocks). A column of tiles is called a panel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the first stage of the LU algorithm I would like to take a
>>>>>>>>>>>>>>> panel, find the pivots, swap the necessary rows, scale and
>>>>>>>>>>>>>>> update the underlying matrix elements. To track the
>>>>>>>>>>>>>>> dependencies I created tile descriptors, that keep the
>>>>>>>>>>>>>>> information about the access mode and the tile handle.
>>>>>>>>>>>>>>> Essentially, the tile descriptors are used to “lock” the
>>>>>>>>>>>>>>> entire panel, all the operations inside are parallelised
>>>>>>>>>>>>>>> manually using a custom barrier and auxiliary arrays, to
>>>>>>>>>>>>>>> store the maximum values and their indices. To be able to
>>>>>>>>>>>>>>> assign a particular task to a thread (processing the panel
>>>>>>>>>>>>>>> factorisation) I use ranks. Depending on a rank each thread
>>>>>>>>>>>>>>> will get its portion of the data to work on. Inside the panel
>>>>>>>>>>>>>>> threads are synchronised manually and wait for each other at
>>>>>>>>>>>>>>> the custom barrier.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please pay attention to the attached figure. A panel
>>>>>>>>>>>>>>> consisting of five tiles is passed to the StarPU task.
>>>>>>>>>>>>>>> Imagine we have three treads processing the panel. To find
>>>>>>>>>>>>>>> the first pivot we assign the first column of each tile to a
>>>>>>>>>>>>>>> certain thread in the Round-Robin manner (0,1,2,0,1). Once
>>>>>>>>>>>>>>> the maximum per tile is found by each thread, the master
>>>>>>>>>>>>>>> thread (with rank 0) will select the global maximum. I would
>>>>>>>>>>>>>>> like to apply the SPMD capability of StarPU to process the
>>>>>>>>>>>>>>> panel and use a custom barrier inside.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please consider the C code below. The code works, but the
>>>>>>>>>>>>>>> threads wait infinitely at the first barrier. My questions
>>>>>>>>>>>>>>> are:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Am I passing the barrier structure correctly, so that it
>>>>>>>>>>>>>>> is “shared" amongst all the threads and the threads “know”
>>>>>>>>>>>>>>> about the status of the other threads. To achieve this I pass
>>>>>>>>>>>>>>> the barrier structure by reference.
>>>>>>>>>>>>>>> 2) Maybe it is the tile descriptors that “block” the
>>>>>>>>>>>>>>> execution of the threads inside the panel? Maybe the threads
>>>>>>>>>>>>>>> with ranks 1, 2 can not proceed, since all the tiles are
>>>>>>>>>>>>>>> blocked by rank 0? Therefore, I can make a conclusion that
>>>>>>>>>>>>>>> “blocking” the tiles like I do is incorrect?
>>>>>>>>>>>>>>> 3) Is there a way to pass a variable to the codelet to set
>>>>>>>>>>>>>>> the “max_parallelism” value instead of hard-coding it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4) If I may, I would like to make a general comment, please.
>>>>>>>>>>>>>>> I like StarPU very much. I think you have invested a great
>>>>>>>>>>>>>>> deal of time and effort into it. Thank you. But to my mind
>>>>>>>>>>>>>>> the weakest point (from my user experience) is passing the
>>>>>>>>>>>>>>> values to StarPU, while inserting a task. There is no type
>>>>>>>>>>>>>>> checking of the variables here. The same applies to the
>>>>>>>>>>>>>>> routine “starpu_codelet_unpack_args()”, when you want to
>>>>>>>>>>>>>>> obtain the values “on the other side”. Sometimes, it becomes
>>>>>>>>>>>>>>> a nightmare and a trial-and-error exercise. If the type
>>>>>>>>>>>>>>> checks could be enforced there, it would make a user’s life
>>>>>>>>>>>>>>> much easier.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // StarPU LU panel factorisation function
>>>>>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>>>>>> void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl,
>>>>>>>>>>>>>>> int *piv,
>>>>>>>>>>>>>>> volatile int *max_idx, volatile plasma_complex64_t
>>>>>>>>>>>>>>> *max_val,
>>>>>>>>>>>>>>> int ib, int rank, int mtpf, volatile int *info,
>>>>>>>>>>>>>>> plasma_barrier_t *barrier)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> …
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>>>>>> // StarPU ZGETRF SPMD CPU kernel
>>>>>>>>>>>>>>> static void core_starpu_cpu_zgetrf_spmd(void *desc[], void
>>>>>>>>>>>>>>> *cl_arg) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> plasma_desc_t A;
>>>>>>>>>>>>>>> int ib, mtpf, k, *piv;
>>>>>>>>>>>>>>> volatile int *max_idx, *info;
>>>>>>>>>>>>>>> volatile plasma_complex64_t *max_val;
>>>>>>>>>>>>>>> plasma_barrier_t *barrier;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Unpack scalar arguments
>>>>>>>>>>>>>>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val,
>>>>>>>>>>>>>>> &ib, &mtpf,
>>>>>>>>>>>>>>> &k, &info, &barrier);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> int rank = starpu_combined_worker_get_rank();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Array of pointers to subdiagonal tiles in panel k (incl.
>>>>>>>>>>>>>>> diagonal tile k)
>>>>>>>>>>>>>>> plasma_complex64_t **pnlK =
>>>>>>>>>>>>>>> (plasma_complex64_t**) malloc((size_t)A.mt *
>>>>>>>>>>>>>>> sizeof(plasma_complex64_t*));
>>>>>>>>>>>>>>> assert(pnlK != NULL);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> printf("Panel: %d\n", k);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Unpack tile data
>>>>>>>>>>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>>>>>>>>>>> pnlK[i] = (plasma_complex64_t *)
>>>>>>>>>>>>>>> STARPU_MATRIX_GET_PTR(desc[i]);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Unpack pivots vector
>>>>>>>>>>>>>>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Call computation kernel
>>>>>>>>>>>>>>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>>>>>>>>>>>>>>> ib, rank, mtpf, info, barrier);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Deallocate container panel
>>>>>>>>>>>>>>> free(pnlK);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>>>>>> // StarPU SPMD codelet
>>>>>>>>>>>>>>> static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> .type = STARPU_SPMD,
>>>>>>>>>>>>>>> .max_parallelism = 2,
>>>>>>>>>>>>>>> .cpu_funcs = { core_starpu_cpu_zgetrf_spmd },
>>>>>>>>>>>>>>> .cpu_funcs_name = { "zgetrf_spmd" },
>>>>>>>>>>>>>>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>>>>>> // StarPU task inserter
>>>>>>>>>>>>>>> void core_starpu_zgetrf_spmd(plasma_desc_t A,
>>>>>>>>>>>>>>> starpu_data_handle_t hPiv,
>>>>>>>>>>>>>>> volatile int *max_idx, volatile
>>>>>>>>>>>>>>> plasma_complex64_t *max_val,
>>>>>>>>>>>>>>> int ib, int mtpf, int k,
>>>>>>>>>>>>>>> volatile int *info, plasma_barrier_t
>>>>>>>>>>>>>>> *barrier) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Pointer to first (top) tile in panel k
>>>>>>>>>>>>>>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Set access modes for subdiagonal tiles in panel k (incl.
>>>>>>>>>>>>>>> diagonal tile k)
>>>>>>>>>>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>>>>>>>>>>> (pk+i)->mode = STARPU_RW;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> int retval = starpu_task_insert(
>>>>>>>>>>>>>>> &core_starpu_codelet_zgetrf_spmd,
>>>>>>>>>>>>>>> STARPU_VALUE, &A,
>>>>>>>>>>>>>>> sizeof(plasma_desc_t),
>>>>>>>>>>>>>>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>>>>>>>>>>>>>>> STARPU_RW, hPiv,
>>>>>>>>>>>>>>> STARPU_VALUE, &max_idx, sizeof(volatile
>>>>>>>>>>>>>>> int*),
>>>>>>>>>>>>>>> STARPU_VALUE, &max_val, sizeof(volatile
>>>>>>>>>>>>>>> plasma_complex64_t*),
>>>>>>>>>>>>>>> STARPU_VALUE, &ib, sizeof(int),
>>>>>>>>>>>>>>> STARPU_VALUE, &mtpf, sizeof(int),
>>>>>>>>>>>>>>> STARPU_VALUE, &k, sizeof(int),
>>>>>>>>>>>>>>> STARPU_VALUE, &info, sizeof(volatile
>>>>>>>>>>>>>>> int*),
>>>>>>>>>>>>>>> STARPU_VALUE, &barrier,
>>>>>>>>>>>>>>> sizeof(plasma_barrier_t*),
>>>>>>>>>>>>>>> STARPU_NAME, "zgetrf",
>>>>>>>>>>>>>>> 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf:
>>>>>>>>>>>>>>> starpu_task_insert() failed");
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>> Maxim
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <lu_panel_fact.jpg>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 24 Jan 2018, at 17:52, Maxim Abalenkov
>>>>>>>>>>>>>>>> <maxim.abalenkov@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Samuel,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you very much! Yes, in this particular use-case
>>>>>>>>>>>>>>>> “STARPU_NONE” would come handy and make the source code much
>>>>>>>>>>>>>>>> more “elegant”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>> Maxim
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 24 Jan 2018, at 17:47, Samuel Thibault
>>>>>>>>>>>>>>>>> <samuel.thibault@inria.fr> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000,
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> I have a very simple question. What is the overhead of
>>>>>>>>>>>>>>>>>> using the STARPU_NONE
>>>>>>>>>>>>>>>>>> access mode for some handles in the STARPU_DATA_MODE_ARRAY?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is not implemented, we hadn't thought it could be
>>>>>>>>>>>>>>>>> useful. I have now
>>>>>>>>>>>>>>>>> added it to the TODO list (but that list is very long and
>>>>>>>>>>>>>>>>> doesn't tend
>>>>>>>>>>>>>>>>> to progress quickly).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The overhead would be quite small: StarPU would just write
>>>>>>>>>>>>>>>>> it down in
>>>>>>>>>>>>>>>>> the array of data to fetch, and just not process that
>>>>>>>>>>>>>>>>> element. Of course
>>>>>>>>>>>>>>>>> the theoretical complexity will be O(number of data).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In order to avoid using complicated offsets in my
>>>>>>>>>>>>>>>>>> computation routines
>>>>>>>>>>>>>>>>>> I would like to pass them a column of matrix tiles, while
>>>>>>>>>>>>>>>>>> setting the
>>>>>>>>>>>>>>>>>> “unused” tiles to “STARPU_NONE”.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Samuel
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Starpu-devel mailing list
>>>> Starpu-devel@lists.gforge.inria.fr
>>>> https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel
>>>
>>
>

[Starpu-devel] Fwd: [LU factorisation: gdb debug output], Olivier Aumage, 13/02/2018
- Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 13/02/2018
  - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 13/02/2018
    - Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 22/02/2018
      - Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 23/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 23/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 23/02/2018
        
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Maxim Abalenkov, 23/02/2018
        Re: [Starpu-devel] [LU factorisation: gdb debug output], Olivier Aumage, 23/02/2018

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]