Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]


Chronologique Discussions 
  • From: Olivier Aumage <olivier.aumage@inria.fr>
  • To: Maxim Abalenkov <maxim.abalenkov@gmail.com>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
  • Date: Tue, 13 Feb 2018 23:14:18 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi Maxim,

In the Fork-Join mode, StarPU indeed only launches the task on the master
thread of the team. This is the intended behavior of this mode, because it is
designed to be used for tasks that launch their own computing threads. This
is the case, for instance, when the parallel task is a kernel from a
"black-box" parallel library, or a kernel written with some third party
OpenMP. If you have a look at Section "6.10.1 Fork-mode Parallel Tasks" from
the StarPU manual, you will see that the example given is an OpenMP kernel.

Moreover, in this Fork-Join mode, StarPU sets a CPU mask just before starting
the parallel task on the master thread. Then the parallel task can use this
mask to know how many computing threads to launch, and on which CPU cores
they should be bound.

Best regards,
--
Olivier

> Le 13 févr. 2018 à 22:44, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
> écrit :
>
> Dear Olivier,
>
> Thank you very much for your help and explanation. Please find attached an
> updated performance plot. I have included the data from application of the
> “lws" scheduling policy and 1 thread per panel factorisation. It is
> interesting to note, that the performance of the algorithm under the “lws"
> almost reaches the performance of “peager”. Which shows that “peager” is
> potentially suboptimal.
>
> I have also tried to experiment with the Fork--Join approach and “peager",
> but unfortunately without any success. The code hangs and does not proceed.
> From my own debugging it seems at the first iteration, for the first panel
> factorisation, only a master thread is launched instead of 4 threads. For
> the subsequent panels all 4 threads enter the panel factorisation kernel,
> the “core” routine, but then they hang. The same happens when I use another
> parallel scheduling policy “pheft”.
>
> The StarPU functions that I use to test the Fork--Join approach are given
> below:
>
> /******************************************************************************/
> // StarPU ZGETRF CPU kernel (Fork-Join)
> static void core_starpu_cpu_zgetrf_fj(void *desc[], void *cl_arg) {
>
> plasma_desc_t A;
> int ib, k, *piv;
> volatile int *max_idx, *info;
> volatile plasma_complex64_t *max_val;
> plasma_barrier_t *barrier;
>
> // Unpack scalar arguments
> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &k,
> &info, &barrier);
>
> int mtpf = starpu_combined_worker_get_size();
>
> // Array of pointers to subdiagonal tiles in panel k (incl. diagonal
> tile k)
> plasma_complex64_t **pnlK =
> (plasma_complex64_t**) malloc((size_t)A.mt *
> sizeof(plasma_complex64_t*));
> assert(pnlK != NULL);
>
> // Unpack tile data
> for (int i = 0; i < A.mt; i++) {
> pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
> }
>
> // Unpack pivots vector
> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>
> // Call computation kernel
> #pragma omp parallel
> #pragma omp master
> {
> #pragma omp taskloop untied shared(barrier) num_tasks(mtpf)
> priority(2)
> for (int rank = 0; rank < mtpf; rank++) {
> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
> ib, rank, mtpf, info, barrier);
> }
> }
>
> // Deallocate container panel
> free(pnlK);
> }
>
> /******************************************************************************/
> // StarPU codelet (Fork-Join)
> static struct starpu_codelet core_starpu_codelet_zgetrf_fj =
> {
> .where = STARPU_CPU,
> .type = STARPU_FORKJOIN,
> .cpu_funcs = { core_starpu_cpu_zgetrf_fj },
> .cpu_funcs_name = { "zgetrf_fj" },
> .nbuffers = STARPU_VARIABLE_NBUFFERS,
> .name = "zgetrf_cl_fj"
> };
>
> /******************************************************************************/
> // StarPU task inserter (Fork-Join)
> void core_starpu_zgetrf_fj(plasma_desc_t A, starpu_data_handle_t hPiv,
> volatile int *max_idx, volatile
> plasma_complex64_t *max_val,
> int ib, int mtpf, int k, int prio,
> volatile int *info, plasma_barrier_t *barrier) {
>
> // Set maximum no. of threads per panel factorisation
> core_starpu_codelet_zgetrf_fj.max_parallelism = mtpf;
>
> // Pointer to first (top) tile in panel k
> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>
> // Set access modes for subdiagonal tiles in panel k (incl. diagonal
> tile k)
> for (int i = 0; i < A.mt; i++) {
> (pk+i)->mode = STARPU_RW;
> }
>
> int retval = starpu_task_insert(
> &core_starpu_codelet_zgetrf_fj,
> STARPU_VALUE, &A, sizeof(plasma_desc_t),
> STARPU_DATA_MODE_ARRAY, pk, A.mt,
> STARPU_RW, hPiv,
> STARPU_VALUE, &max_idx, sizeof(volatile int*),
> STARPU_VALUE, &max_val, sizeof(volatile
> plasma_complex64_t*),
> STARPU_VALUE, &ib, sizeof(int),
> STARPU_VALUE, &k, sizeof(int),
> STARPU_VALUE, &info, sizeof(volatile int*),
> STARPU_VALUE, &barrier, sizeof(plasma_barrier_t*),
> STARPU_NAME, "zgetrf_fj",
> 0);
>
> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf_fj:
> starpu_task_insert() failed");
> }
>
>
> Best wishes,
> Maxim
>
> <haswell_dgetrf_starpu_spmd_lws.pdf>
>
> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>
>> On 13 Feb 2018, at 18:09, Olivier Aumage <olivier.aumage@inria.fr> wrote:
>>
>> [missing forward to the list]
>>
>>> Début du message réexpédié :
>>>
>>> De: Olivier Aumage <olivier.aumage@inria.fr>
>>> Objet: Rép : [Starpu-devel] [LU factorisation: gdb debug output]
>>> Date: 13 février 2018 à 19:06:01 UTC+1
>>> À: Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>
>>> Hi Maxim,
>>>
>>> It is actually expected that the patch benefit is low.
>>>
>>> The main issue with 'peager' is that the initialization phase builds some
>>> table indicating, for each worker, who is the master of its parallel
>>> team. However, this is iterated, for teams of increasing sizes up to the
>>> team containing all workers. Thus, every worker ends up being assigned
>>> the worker 0 as master worker. The result is that only the worker 0
>>> fetches parallel tasks in the unpatched version, and the tasks are
>>> therefore serialized with respect to each other. This is why you obtained
>>> a flat scalability plot with that version.
>>>
>>> The small patch I sent simply limit the size of the worker teams, to
>>> avoid having every worker to be under the control of worker 0. I put an
>>> arbitrary limit of 4 workers per group in the patch.
>>>
>>> Of course, this is only temporary. I am away from office this week, and I
>>> need to check with some colleagues about why the code was organized this
>>> way, and if, perhaps, the peager implementation made some assumptions
>>> about some other parts of the StarPU core that are no longer true.
>>>
>>> Best regards,
>>> --
>>> Olivier
>>>
>>>> Le 13 févr. 2018 à 17:36, Maxim Abalenkov <maxim.abalenkov@gmail.com> a
>>>> écrit :
>>>>
>>>> Dear Olivier,
>>>>
>>>> Please find attached a plot of my experiments with various no. of SPMD
>>>> threads working on the LU panel factorisation. Using more threads is
>>>> beneficial, but unfortunately, the benefit is miniscule. I have also
>>>> implemented the Fork—Join approach wrapping around the panel
>>>> factorisation done by OpenMP. I will show you the results soon. Thank
>>>> you very much for your help!
>>>>
>>>> —
>>>> Best wishes,
>>>> Maxim
>>>>
>>>> <haswell_dgetrf_starpu_spmd.pdf>
>>>>
>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>
>>>>> On 12 Feb 2018, at 22:01, Olivier Aumage <olivier.aumage@inria.fr>
>>>>> wrote:
>>>>>
>>>>> Hi Maxim,
>>>>>
>>>>> Regarding the issue with 'peager' scalability, the unpatched master
>>>>> branch should be similar to the 1.2.3 version. However, since 'peager'
>>>>> is still considered experimental, it is probably better to switch to
>>>>> the master branch, as fixes will likely arrive there first.
>>>>>
>>>>> Best regards,
>>>>> --
>>>>> Olivier
>>>>>
>>>>>> Le 12 févr. 2018 à 22:50, Maxim Abalenkov <maxim.abalenkov@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>> Hello Olivier,
>>>>>>
>>>>>> I’m using the version 1.2.3, downloaded from the INRIA website. Would
>>>>>> it be better to use the “rolling” edition? I will install it tomorrow
>>>>>> morning!
>>>>>>
>>>>>> —
>>>>>> Best wishes,
>>>>>> Maxim
>>>>>>
>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>
>>>>>>> On 12 Feb 2018, at 21:46, Olivier Aumage <olivier.aumage@inria.fr>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Maxim,
>>>>>>>
>>>>>>> My patch was against the StarPU's master branch as of Saturday
>>>>>>> morning. Which version of StarPU are you currently using?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> --
>>>>>>> Olivier
>>>>>>>
>>>>>>>> Le 12 févr. 2018 à 16:20, Maxim Abalenkov
>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>
>>>>>>>> Hello Olivier,
>>>>>>>>
>>>>>>>> Thank you very much for your reply and the patch. I have applied the
>>>>>>>> patch to the code and will re-run the experiments. I will get back
>>>>>>>> to you with the results. I think one of the changes in the patch
>>>>>>>> wasn’t successful. Please find below the output of the patch command
>>>>>>>> and the file with the rejects. Thank you and have a good day ahead!
>>>>>>>>
>>>>>>>> —
>>>>>>>> Best wishes,
>>>>>>>> Maksims
>>>>>>>>
>>>>>>>> <log>
>>>>>>>> <parallel_eager.c.rej>
>>>>>>>>
>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>
>>>>>>>>> On 10 Feb 2018, at 11:17, Olivier Aumage <olivier.aumage@inria.fr>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Maxim,
>>>>>>>>>
>>>>>>>>> I am not familiar with the peager implementation of StarPU (nor, I
>>>>>>>>> believe, Samuel). I have had a quick look at the peager policy
>>>>>>>>> code, and there seems to be an issue with the initialization phase
>>>>>>>>> of the policy. Or perhaps I do not get the rationale of it...
>>>>>>>>>
>>>>>>>>> Can you check if the quick patch in attachment improve the
>>>>>>>>> scalability of your code? You can apply it with the following
>>>>>>>>> command:
>>>>>>>>> $ patch -p1 <../peager.patch
>>>>>>>>>
>>>>>>>>> This is only meant to be a temporary fix, however. I need to check
>>>>>>>>> with people who wrote the code about what the initial intent was.
>>>>>>>>>
>>>>>>>>> Hope this helps.
>>>>>>>>> Best regards,
>>>>>>>>> --
>>>>>>>>> Olivier
>>>>>>>>>
>>>>>>>>> <peager.patch>
>>>>>>>>>
>>>>>>>>>> Le 8 févr. 2018 à 16:19, Maxim Abalenkov
>>>>>>>>>> <maxim.abalenkov@gmail.com> a écrit :
>>>>>>>>>>
>>>>>>>>>> Dear all,
>>>>>>>>>>
>>>>>>>>>> I have implemented the parallel panel factorisation in LU with the
>>>>>>>>>> StarPU’s SPMD capability. Here are a few answers to my own
>>>>>>>>>> questions:
>>>>>>>>>>
>>>>>>>>>>> 1) Am I passing the barrier structure correctly, so that it is
>>>>>>>>>>> “shared" amongst all the threads and the threads “know” about the
>>>>>>>>>>> status of the other threads. To achieve this I pass the barrier
>>>>>>>>>>> structure by reference.
>>>>>>>>>>
>>>>>>>>>> Yes, it is passed correctly. All other threads "know about” and
>>>>>>>>>> share the values inside the barrier structure.
>>>>>>>>>>
>>>>>>>>>>> 2) Maybe it is the tile descriptors that “block” the execution of
>>>>>>>>>>> the threads inside the panel? Maybe the threads with ranks 1, 2
>>>>>>>>>>> can not proceed, since all the tiles are blocked by rank 0?
>>>>>>>>>>> Therefore, I can make a conclusion that “blocking” the tiles like
>>>>>>>>>>> I do is incorrect?
>>>>>>>>>>
>>>>>>>>>> Tile “blocking” is correct. The problem did not lie in the tile
>>>>>>>>>> “blocking”, but rather in the application of a non-parallel StarPU
>>>>>>>>>> scheduler. According to the StarPU handbook only two schedulers
>>>>>>>>>> “pheft” and “peager” support the SPMD mode of execution.
>>>>>>>>>>
>>>>>>>>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>>>>>>>>> “max_parallelism” value instead of hard-coding it?
>>>>>>>>>>
>>>>>>>>>> Since the codelet is a static structure, I am setting the maximum
>>>>>>>>>> number of threads by accessing the “max_parallelism" value as
>>>>>>>>>> follows. It is set right before inserting the SPMD task:
>>>>>>>>>>
>>>>>>>>>> // Set maximum no. of threads per panel factorisation
>>>>>>>>>> core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;
>>>>>>>>>>
>>>>>>>>>> Please find attached a performance plot of LU factorisation (with
>>>>>>>>>> and without SPMD functionality) executed on a 20 core Haswell
>>>>>>>>>> machine. I believe, something goes terribly wrong since the SPMD
>>>>>>>>>> performance numbers are so low. I have used the following command
>>>>>>>>>> to execute the tests:
>>>>>>>>>>
>>>>>>>>>> export MKL_NUM_THTREADS=20
>>>>>>>>>> export OMP_NUM_THTREADS=20
>>>>>>>>>> export OMP_PROC_BIND=true
>>>>>>>>>> export STARPU_NCPU=20
>>>>>>>>>> export STARPU_SCHED=peager
>>>>>>>>>> export PLASMA_TUNING_FILENAME=...
>>>>>>>>>>
>>>>>>>>>> numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=...
>>>>>>>>>> —mtpf=... —iter=...
>>>>>>>>>>
>>>>>>>>>> Any insight and help in recovering the performance numbers would
>>>>>>>>>> be greatly appreciated. Thank you and have a good day!
>>>>>>>>>>
>>>>>>>>>> —
>>>>>>>>>> Best wishes,
>>>>>>>>>> Maxim
>>>>>>>>>>
>>>>>>>>>> <haswell_dgetrf2_starpu.pdf>
>>>>>>>>>>
>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>
>>>>>>>>>>> On 5 Feb 2018, at 12:16, Maxim Abalenkov
>>>>>>>>>>> <maxim.abalenkov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Dear all,
>>>>>>>>>>>
>>>>>>>>>>> I’m on a mission to apply the SPMD capability of the StarPU
>>>>>>>>>>> (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks)
>>>>>>>>>>> for a panel factorisation stage of the LU algorithm. Please see
>>>>>>>>>>> the figure attached for an example of my scenario.
>>>>>>>>>>>
>>>>>>>>>>> The matrix is viewed as a set of tiles (rectangular or square
>>>>>>>>>>> matrix blocks). A column of tiles is called a panel.
>>>>>>>>>>>
>>>>>>>>>>> In the first stage of the LU algorithm I would like to take a
>>>>>>>>>>> panel, find the pivots, swap the necessary rows, scale and update
>>>>>>>>>>> the underlying matrix elements. To track the dependencies I
>>>>>>>>>>> created tile descriptors, that keep the information about the
>>>>>>>>>>> access mode and the tile handle. Essentially, the tile
>>>>>>>>>>> descriptors are used to “lock” the entire panel, all the
>>>>>>>>>>> operations inside are parallelised manually using a custom
>>>>>>>>>>> barrier and auxiliary arrays, to store the maximum values and
>>>>>>>>>>> their indices. To be able to assign a particular task to a thread
>>>>>>>>>>> (processing the panel factorisation) I use ranks. Depending on a
>>>>>>>>>>> rank each thread will get its portion of the data to work on.
>>>>>>>>>>> Inside the panel threads are synchronised manually and wait for
>>>>>>>>>>> each other at the custom barrier.
>>>>>>>>>>>
>>>>>>>>>>> Please pay attention to the attached figure. A panel consisting
>>>>>>>>>>> of five tiles is passed to the StarPU task. Imagine we have three
>>>>>>>>>>> treads processing the panel. To find the first pivot we assign
>>>>>>>>>>> the first column of each tile to a certain thread in the
>>>>>>>>>>> Round-Robin manner (0,1,2,0,1). Once the maximum per tile is
>>>>>>>>>>> found by each thread, the master thread (with rank 0) will select
>>>>>>>>>>> the global maximum. I would like to apply the SPMD capability of
>>>>>>>>>>> StarPU to process the panel and use a custom barrier inside.
>>>>>>>>>>>
>>>>>>>>>>> Please consider the C code below. The code works, but the threads
>>>>>>>>>>> wait infinitely at the first barrier. My questions are:
>>>>>>>>>>>
>>>>>>>>>>> 1) Am I passing the barrier structure correctly, so that it is
>>>>>>>>>>> “shared" amongst all the threads and the threads “know” about the
>>>>>>>>>>> status of the other threads. To achieve this I pass the barrier
>>>>>>>>>>> structure by reference.
>>>>>>>>>>> 2) Maybe it is the tile descriptors that “block” the execution of
>>>>>>>>>>> the threads inside the panel? Maybe the threads with ranks 1, 2
>>>>>>>>>>> can not proceed, since all the tiles are blocked by rank 0?
>>>>>>>>>>> Therefore, I can make a conclusion that “blocking” the tiles like
>>>>>>>>>>> I do is incorrect?
>>>>>>>>>>> 3) Is there a way to pass a variable to the codelet to set the
>>>>>>>>>>> “max_parallelism” value instead of hard-coding it?
>>>>>>>>>>>
>>>>>>>>>>> 4) If I may, I would like to make a general comment, please. I
>>>>>>>>>>> like StarPU very much. I think you have invested a great deal of
>>>>>>>>>>> time and effort into it. Thank you. But to my mind the weakest
>>>>>>>>>>> point (from my user experience) is passing the values to StarPU,
>>>>>>>>>>> while inserting a task. There is no type checking of the
>>>>>>>>>>> variables here. The same applies to the routine
>>>>>>>>>>> “starpu_codelet_unpack_args()”, when you want to obtain the
>>>>>>>>>>> values “on the other side”. Sometimes, it becomes a nightmare and
>>>>>>>>>>> a trial-and-error exercise. If the type checks could be enforced
>>>>>>>>>>> there, it would make a user’s life much easier.
>>>>>>>>>>>
>>>>>>>>>>> // StarPU LU panel factorisation function
>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>> void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int
>>>>>>>>>>> *piv,
>>>>>>>>>>> volatile int *max_idx, volatile plasma_complex64_t
>>>>>>>>>>> *max_val,
>>>>>>>>>>> int ib, int rank, int mtpf, volatile int *info,
>>>>>>>>>>> plasma_barrier_t *barrier)
>>>>>>>>>>> {
>>>>>>>>>>> …
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>> // StarPU ZGETRF SPMD CPU kernel
>>>>>>>>>>> static void core_starpu_cpu_zgetrf_spmd(void *desc[], void
>>>>>>>>>>> *cl_arg) {
>>>>>>>>>>>
>>>>>>>>>>> plasma_desc_t A;
>>>>>>>>>>> int ib, mtpf, k, *piv;
>>>>>>>>>>> volatile int *max_idx, *info;
>>>>>>>>>>> volatile plasma_complex64_t *max_val;
>>>>>>>>>>> plasma_barrier_t *barrier;
>>>>>>>>>>>
>>>>>>>>>>> // Unpack scalar arguments
>>>>>>>>>>> starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib,
>>>>>>>>>>> &mtpf,
>>>>>>>>>>> &k, &info, &barrier);
>>>>>>>>>>>
>>>>>>>>>>> int rank = starpu_combined_worker_get_rank();
>>>>>>>>>>>
>>>>>>>>>>> // Array of pointers to subdiagonal tiles in panel k (incl.
>>>>>>>>>>> diagonal tile k)
>>>>>>>>>>> plasma_complex64_t **pnlK =
>>>>>>>>>>> (plasma_complex64_t**) malloc((size_t)A.mt *
>>>>>>>>>>> sizeof(plasma_complex64_t*));
>>>>>>>>>>> assert(pnlK != NULL);
>>>>>>>>>>>
>>>>>>>>>>> printf("Panel: %d\n", k);
>>>>>>>>>>>
>>>>>>>>>>> // Unpack tile data
>>>>>>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>>>>>>> pnlK[i] = (plasma_complex64_t *)
>>>>>>>>>>> STARPU_MATRIX_GET_PTR(desc[i]);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> // Unpack pivots vector
>>>>>>>>>>> piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);
>>>>>>>>>>>
>>>>>>>>>>> // Call computation kernel
>>>>>>>>>>> core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
>>>>>>>>>>> ib, rank, mtpf, info, barrier);
>>>>>>>>>>>
>>>>>>>>>>> // Deallocate container panel
>>>>>>>>>>> free(pnlK);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>> // StarPU SPMD codelet
>>>>>>>>>>> static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
>>>>>>>>>>> {
>>>>>>>>>>> .type = STARPU_SPMD,
>>>>>>>>>>> .max_parallelism = 2,
>>>>>>>>>>> .cpu_funcs = { core_starpu_cpu_zgetrf_spmd },
>>>>>>>>>>> .cpu_funcs_name = { "zgetrf_spmd" },
>>>>>>>>>>> .nbuffers = STARPU_VARIABLE_NBUFFERS,
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> /******************************************************************************/
>>>>>>>>>>> // StarPU task inserter
>>>>>>>>>>> void core_starpu_zgetrf_spmd(plasma_desc_t A,
>>>>>>>>>>> starpu_data_handle_t hPiv,
>>>>>>>>>>> volatile int *max_idx, volatile
>>>>>>>>>>> plasma_complex64_t *max_val,
>>>>>>>>>>> int ib, int mtpf, int k,
>>>>>>>>>>> volatile int *info, plasma_barrier_t
>>>>>>>>>>> *barrier) {
>>>>>>>>>>>
>>>>>>>>>>> // Pointer to first (top) tile in panel k
>>>>>>>>>>> struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);
>>>>>>>>>>>
>>>>>>>>>>> // Set access modes for subdiagonal tiles in panel k (incl.
>>>>>>>>>>> diagonal tile k)
>>>>>>>>>>> for (int i = 0; i < A.mt; i++) {
>>>>>>>>>>> (pk+i)->mode = STARPU_RW;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> int retval = starpu_task_insert(
>>>>>>>>>>> &core_starpu_codelet_zgetrf_spmd,
>>>>>>>>>>> STARPU_VALUE, &A, sizeof(plasma_desc_t),
>>>>>>>>>>> STARPU_DATA_MODE_ARRAY, pk, A.mt,
>>>>>>>>>>> STARPU_RW, hPiv,
>>>>>>>>>>> STARPU_VALUE, &max_idx, sizeof(volatile int*),
>>>>>>>>>>> STARPU_VALUE, &max_val, sizeof(volatile
>>>>>>>>>>> plasma_complex64_t*),
>>>>>>>>>>> STARPU_VALUE, &ib, sizeof(int),
>>>>>>>>>>> STARPU_VALUE, &mtpf, sizeof(int),
>>>>>>>>>>> STARPU_VALUE, &k, sizeof(int),
>>>>>>>>>>> STARPU_VALUE, &info, sizeof(volatile int*),
>>>>>>>>>>> STARPU_VALUE, &barrier,
>>>>>>>>>>> sizeof(plasma_barrier_t*),
>>>>>>>>>>> STARPU_NAME, "zgetrf",
>>>>>>>>>>> 0);
>>>>>>>>>>>
>>>>>>>>>>> STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf:
>>>>>>>>>>> starpu_task_insert() failed");
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> —
>>>>>>>>>>> Best wishes,
>>>>>>>>>>> Maxim
>>>>>>>>>>>
>>>>>>>>>>> <lu_panel_fact.jpg>
>>>>>>>>>>>
>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>
>>>>>>>>>>>> On 24 Jan 2018, at 17:52, Maxim Abalenkov
>>>>>>>>>>>> <maxim.abalenkov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Samuel,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you very much! Yes, in this particular use-case
>>>>>>>>>>>> “STARPU_NONE” would come handy and make the source code much
>>>>>>>>>>>> more “elegant”.
>>>>>>>>>>>>
>>>>>>>>>>>> —
>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>> Maxim
>>>>>>>>>>>>
>>>>>>>>>>>> Maxim Abalenkov \\ maxim.abalenkov@gmail.com
>>>>>>>>>>>> +44 7 486 486 505 \\ http://mabalenk.gitlab.io
>>>>>>>>>>>>
>>>>>>>>>>>>> On 24 Jan 2018, at 17:47, Samuel Thibault
>>>>>>>>>>>>> <samuel.thibault@inria.fr> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
>>>>>>>>>>>>>> I have a very simple question. What is the overhead of using
>>>>>>>>>>>>>> the STARPU_NONE
>>>>>>>>>>>>>> access mode for some handles in the STARPU_DATA_MODE_ARRAY?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is not implemented, we hadn't thought it could be useful. I
>>>>>>>>>>>>> have now
>>>>>>>>>>>>> added it to the TODO list (but that list is very long and
>>>>>>>>>>>>> doesn't tend
>>>>>>>>>>>>> to progress quickly).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The overhead would be quite small: StarPU would just write it
>>>>>>>>>>>>> down in
>>>>>>>>>>>>> the array of data to fetch, and just not process that element.
>>>>>>>>>>>>> Of course
>>>>>>>>>>>>> the theoretical complexity will be O(number of data).
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to avoid using complicated offsets in my computation
>>>>>>>>>>>>>> routines
>>>>>>>>>>>>>> I would like to pass them a column of matrix tiles, while
>>>>>>>>>>>>>> setting the
>>>>>>>>>>>>>> “unused” tiles to “STARPU_NONE”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Samuel
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>





Archives gérées par MHonArc 2.6.19+.

Haut de le page