Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]


Chronologique Discussions 
  • From: Maxim Abalenkov <maxim.abalenkov@gmail.com>
  • To: Olivier Aumage <olivier.aumage@inria.fr>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
  • Date: Fri, 23 Feb 2018 15:33:32 +0000
  • Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=maxim.abalenkov@gmail.com; spf=Pass smtp.mailfrom=maxim.abalenkov@gmail.com; spf=None smtp.helo=postmaster@mail-wr0-f177.google.com
  • Ironport-phdr: 9a23:qUh2AhYh0cKWtmOrA/z5NdP/LSx+4OfEezUN459isYplN5qZrsy5bnLW6fgltlLVR4KTs6sC17KN9fi4EUU7or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRpOOv1BpTSj8Oq3Oyu5pHfeQpFiCazbL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPCTQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjm58axlVAHnhzsGNz4h8WHYlMpwjL5AoBm8oxBz2pPYbJ2JOPZ7eK7WYNEUSndbXstJVyJPAZ+zYIQSAeQPP+lWsYf9qVwVoBSkGQWsAfniyj9UinL026AxzuQvERvB3AwlB98At27brdr0NKcXTOu40LLHwi/Hb/xI3zf964/Icg48qvyLWLJ/a8XQyUgqFw/flFqfspbqPzeL2eQLsGib6PRgWPmgi24isQ5xozyvyt0whYnOg4IY01bJ/jh3zoYyIN23Uk97Ydi8HZtRsSGaLYp2Tdk4T2FmoiY20rIGuZ+nfCgO0pso3ATTa/2Ac4WO/xntV/6RLC9miH55fL+znRW//Ei6xuHhSMW500xGojdHn9TOrnwByhLe5tSdRvZ+/UqtwyiD2g/P5u1eIE05lKzWIIM7zLEqjJocq0HDEzf2mEroiK+WcV0p+u2y5OTmZrXqv5GdN5Vohg3nPKQih86yDOYiPggBWGib/uu81Ln98kHjXLpKifg2nrHYsJDcO8sbura0DxFJ3osn8RqyDDer3M4FkXUZL19JYg+LgorrNl3WJfD3F/a/g1CikDdxwPDGO6XsApDXIXnMkbfheKxx5FRHxwUpydBQ+ZRUCrIGIPLtQULxu9nYAQU4Mwyw2eroFNJ91oYGVWKVHqCZKL/SsUOP5u83OOmMeJUauCzlK/g4/vLhkGE2mUEDcqmtxpYXbHG4Hu96I0WCe3bsjdkBEWAQvgoxUuPmklyCUThJZ3azRa0w/D87CJj1RbvEE6epgaKA0T3zN4BTb29LQgSLFXb2doieHf4RbSudL+dglCYFXP6vUdly+wupsVrfwqpmK6L98CQcuJTg08Y9s+jahRA3szV+BsCQ1WKKUUl7m2oJQ3k926Up8h818UuKzaUt268QLtdU/f4cF15ibceNndw/MMj7X0f6RvnMTV+nRtu8BjRoF4A+xtYPZwB2HNDw10mfjRrvOKcckvmwPLJx6rjVhiGjKMN0ynKA364k3QF/H5l/cFa+j6s6zDD9Qo7El0LDyfSvfKUYmTHXrCKNlDXR+k5fVwF0XOPOWnVNPkY=
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello Olivier,

After running the example with the first command I obtain:

STARPU_SINGLE_COMBINED_WORKER=0 STARPU_SCHED=peager ./vector_scal_spmd
BEFORE: First element was 1.000000
BEFORE: Last element was 204800.000000
running task with 2 CPUs.
running task with 4 CPUs.
running task with 2 CPUs.
...
running task with 4 CPUs.
AFTER: First element is 21926.833984
AFTER: Last element is 4490649600.000000

After running it with the second command I see:

STARPU_SINGLE_COMBINED_WORKER=1 STARPU_SCHED=peager ./vector_scal_spmd
BEFORE: First element was 1.000000
BEFORE: Last element was 204800.000000
running task with 4 CPUs.
running task with 4 CPUs.
running task with 4 CPUs.
...
running task with 4 CPUs.
AFTER: First element is 21926.833984
AFTER: Last element is 4490649600.000000

Obviously, it works for the vector scaling example. May I ask, what does the STARPU_SINGLE_COMBINED_WORKER environment variable set? I haven’t been using it for my test case. I only set the STARPU_NCPU and STARPU_SCHED.

Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 23 Feb 2018, at 15:12, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

What do you obtain when you try to run the "vector_scal_spmd" example program of StarPU on your machine with the following two command lines ?:

STARPU_SINGLE_COMBINED_WORKER=0 STARPU_SCHED=peager ./install-trunk-cpu/lib/starpu/examples/vector_scal_spmd

STARPU_SINGLE_COMBINED_WORKER=1 STARPU_SCHED=peager ./install-trunk-cpu/lib/starpu/examples/vector_scal_spmd

The first one should run parallel tasks of varying sizes, while the second one should run tasks at max parallel size only. This is what I obtain at this time on our machine.

Best regards,
--
Olivier

Le 23 févr. 2018 à 15:16, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Hello Olivier,

I have been testing my code with the new “peager” for a while now. I think, there might be a mistake in the policy implementation. My LU code “hangs” using peager and multiple coworker threads. Thank you.


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 23 Feb 2018, at 12:29, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Hello Olivier,

Thank you very much for your email. I have “un-applied” your patch and “pulled" the new version of the “peager” implementation. Unfortunately, it seems setting the “max_parallelism” attribute of the static codelet struct is no longer working, i.e. prior to inserting a new task I have specified this value as:

 codelet_spmd.max_parallelism = N;

Then in the “CPU” function I call:

 int n = starpu_combined_worker_get_size();

However, the value returned is always 1.


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 22 Feb 2018, at 14:54, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

I have implemented a new algorithm for the Parallel Eager scheduling policy earlier this week. It is available in the master/ branch and the starpu-1.3/ branch. It replaces the previous "peager" implementation.

It is not meant to be much clever, since this is still a greedy scheduler, but it is able to schedule parallel tasks while also scheduling tasks in parallel.

You can perhaps use it as a basis for a scheduler more customized to your needs, either by acting on some parameters (combined workers min and max size, codelet's max parallelism) or even by copying and extending it.

There is indeed some very application-dependent trade-off to find between favoring intra-task parallelism vs inter-task parallelism. Favoring intra-task parallelism means that the scheduler will try harder to build a parallel team of workers to work on a single task (but potentially increasing idle time while building such a team). On the contrary, favoring inter-task parallelism means that the scheduler will try to schedule tasks ASAP, even if only small parallel teams are available.

Note: you should remove the temporary patch I sent earlier, before pulling the new implementation.

Best regards,
--
Olivier

Le 13 févr. 2018 à 23:14, Olivier Aumage <olivier.aumage@inria.fr> a écrit :

Hi Maxim,

In the Fork-Join mode, StarPU indeed only launches the task on the master thread of the team. This is the intended behavior of this mode, because it is designed to be used for tasks that launch their own computing threads. This is the case, for instance, when the parallel task is a kernel from a "black-box" parallel library, or a kernel written with some third party OpenMP. If you have a look at Section "6.10.1 Fork-mode Parallel Tasks" from the StarPU manual, you will see that the example given is an OpenMP kernel.

Moreover, in this Fork-Join mode, StarPU sets a CPU mask just before starting the parallel task on the master thread. Then the parallel task can use this mask to know how many computing threads to launch, and on which CPU cores they should be bound.

Best regards,
--
Olivier

Le 13 févr. 2018 à 22:44, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Dear Olivier,

Thank you very much for your help and explanation. Please find attached an updated performance plot. I have included the data from application of the “lws" scheduling policy and 1 thread per panel factorisation. It is interesting to note, that the performance of the algorithm under the “lws" almost reaches the performance of “peager”. Which shows that “peager” is potentially suboptimal.

I have also tried to experiment with the Fork--Join approach and “peager", but unfortunately without any success. The code hangs and does not proceed. From my own debugging it seems at the first iteration, for the first panel factorisation, only a master thread is launched instead of 4 threads. For the subsequent panels all 4 threads enter the panel factorisation kernel, the “core” routine, but then they hang. The same happens when I use another parallel scheduling policy “pheft”.

The StarPU functions that I use to test the Fork--Join approach are given below:

/******************************************************************************/
// StarPU ZGETRF CPU kernel (Fork-Join)
static void core_starpu_cpu_zgetrf_fj(void *desc[], void *cl_arg) {

 plasma_desc_t A;
 int ib, k, *piv;
 volatile int *max_idx, *info;
 volatile plasma_complex64_t *max_val;
 plasma_barrier_t *barrier;

 // Unpack scalar arguments
 starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &k,
                            &info, &barrier);

 int mtpf = starpu_combined_worker_get_size();

 // Array of pointers to subdiagonal tiles in panel k (incl. diagonal tile k)
 plasma_complex64_t **pnlK =
     (plasma_complex64_t**) malloc((size_t)A.mt * sizeof(plasma_complex64_t*));
 assert(pnlK != NULL);

 // Unpack tile data
 for (int i = 0; i < A.mt; i++) {
     pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
 }

 // Unpack pivots vector
 piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);

 // Call computation kernel
 #pragma omp parallel
 #pragma omp master
 {
     #pragma omp taskloop untied shared(barrier) num_tasks(mtpf) priority(2)
     for (int rank = 0; rank < mtpf; rank++) {
         core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
                     ib, rank, mtpf, info, barrier);
     }
 }

 // Deallocate container panel
 free(pnlK);
}

/******************************************************************************/
// StarPU codelet (Fork-Join)
static struct starpu_codelet core_starpu_codelet_zgetrf_fj =
{
 .where                  = STARPU_CPU,
 .type                     = STARPU_FORKJOIN,
 .cpu_funcs            = { core_starpu_cpu_zgetrf_fj },
 .cpu_funcs_name = { "zgetrf_fj" },
 .nbuffers               = STARPU_VARIABLE_NBUFFERS,
 .name                   = "zgetrf_cl_fj"
};

/******************************************************************************/
// StarPU task inserter (Fork-Join)
void core_starpu_zgetrf_fj(plasma_desc_t A, starpu_data_handle_t hPiv,
                        volatile int *max_idx, volatile plasma_complex64_t *max_val,
                        int ib, int mtpf, int k, int prio,
                        volatile int *info, plasma_barrier_t *barrier) {

 // Set maximum no. of threads per panel factorisation
 core_starpu_codelet_zgetrf_fj.max_parallelism = mtpf;

 // Pointer to first (top) tile in panel k
 struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);

 // Set access modes for subdiagonal tiles in panel k (incl. diagonal tile k)
 for (int i = 0; i < A.mt; i++) {
     (pk+i)->mode = STARPU_RW;
 }

 int retval = starpu_task_insert(
     &core_starpu_codelet_zgetrf_fj,
     STARPU_VALUE,               &A,         sizeof(plasma_desc_t),
     STARPU_DATA_MODE_ARRAY,      pk,        A.mt,
     STARPU_RW,                   hPiv,
     STARPU_VALUE,               &max_idx,   sizeof(volatile int*),
     STARPU_VALUE,               &max_val,   sizeof(volatile plasma_complex64_t*),
     STARPU_VALUE,               &ib,        sizeof(int),
     STARPU_VALUE,               &k,         sizeof(int),
     STARPU_VALUE,               &info,      sizeof(volatile int*),
     STARPU_VALUE,               &barrier,   sizeof(plasma_barrier_t*),
     STARPU_NAME,                "zgetrf_fj",
     0);

 STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf_fj: starpu_task_insert() failed");
}


Best wishes,
Maxim

<haswell_dgetrf_starpu_spmd_lws.pdf>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 13 Feb 2018, at 18:09, Olivier Aumage <olivier.aumage@inria.fr> wrote:

[missing forward to the list]

Début du message réexpédié :

De: Olivier Aumage <olivier.aumage@inria.fr>
Objet: Rép : [Starpu-devel] [LU factorisation: gdb debug output]
Date: 13 février 2018 à 19:06:01 UTC+1
À: Maxim Abalenkov <maxim.abalenkov@gmail.com>

Hi Maxim,

It is actually expected that the patch benefit is low.

The main issue with 'peager' is that the initialization phase builds some table indicating, for each worker, who is the master of its parallel team. However, this is iterated, for teams of increasing sizes up to the team containing all workers. Thus, every worker ends up being assigned the worker 0 as master worker. The result is that only the worker 0 fetches parallel tasks in the unpatched version, and the tasks are therefore serialized with respect to each other. This is why you obtained a flat scalability plot with that version.

The small patch I sent simply limit the size of the worker teams, to avoid having every worker to be under the control of worker 0. I put an arbitrary limit of 4 workers per group in the patch.

Of course, this is only temporary. I am away from office this week, and I need to check with some colleagues about why the code was organized this way, and if, perhaps, the peager implementation made some assumptions about some other parts of the StarPU core that are no longer true.

Best regards,
--
Olivier

Le 13 févr. 2018 à 17:36, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Dear Olivier,

Please find attached a plot of my experiments with various no. of SPMD threads working on the LU panel factorisation. Using more threads is beneficial, but unfortunately, the benefit is miniscule. I have also implemented the Fork—Join approach wrapping around the panel factorisation done by OpenMP. I will show you the results soon. Thank you very much for your help!


Best wishes,
Maxim

<haswell_dgetrf_starpu_spmd.pdf>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 12 Feb 2018, at 22:01, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

Regarding the issue with 'peager' scalability, the unpatched master branch should be similar to the 1.2.3 version. However, since 'peager' is still considered experimental, it is probably better to switch to the master branch, as fixes will likely arrive there first.

Best regards,
--
Olivier

Le 12 févr. 2018 à 22:50, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Hello Olivier,

I’m using the version 1.2.3, downloaded from the INRIA website. Would it be better to use the “rolling” edition? I will install it tomorrow morning!


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 12 Feb 2018, at 21:46, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

My patch was against the StarPU's master branch as of Saturday morning. Which version of StarPU are you currently using?

Best regards,
--
Olivier

Le 12 févr. 2018 à 16:20, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Hello Olivier,

Thank you very much for your reply and the patch. I have applied the patch to the code and will re-run the experiments. I will get back to you with the results. I think one of the changes in the patch wasn’t successful. Please find below the output of the patch command and the file with the rejects. Thank you and have a good day ahead!


Best wishes,
Maksims

<log>
<parallel_eager.c.rej>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 10 Feb 2018, at 11:17, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

I am not familiar with the peager implementation of StarPU (nor, I believe, Samuel). I have had a quick look at the peager policy code, and there seems to be an issue with the initialization phase of the policy. Or perhaps I do not get the rationale of it...

Can you check if the quick patch in attachment improve the scalability of your code? You can apply it with the following command:
$ patch -p1 <../peager.patch

This is only meant to be a temporary fix, however. I need to check with people who wrote the code about what the initial intent was.

Hope this helps.
Best regards,
--
Olivier

<peager.patch>

Le 8 févr. 2018 à 16:19, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Dear all,

I have implemented the parallel panel factorisation in LU with the StarPU’s SPMD capability. Here are a few answers to my own questions:

1) Am I passing the barrier structure correctly, so that it is “shared" amongst all the threads and the threads “know” about the status of the other threads. To achieve this I pass the barrier structure by reference.

Yes, it is passed correctly. All other threads "know about” and share the values inside the barrier structure.

2) Maybe it is the tile descriptors that “block” the execution of the threads inside the panel? Maybe the threads with ranks 1, 2 can not proceed, since all the tiles are blocked by rank 0? Therefore, I can make a conclusion that “blocking” the tiles like I do is incorrect?

Tile “blocking” is correct. The problem did not lie in the tile “blocking”, but rather in the application of a non-parallel StarPU scheduler. According to the StarPU handbook only two schedulers “pheft” and “peager” support the SPMD mode of execution.

3) Is there a way to pass a variable to the codelet to set the “max_parallelism” value instead of hard-coding it?

Since the codelet is a static structure, I am setting the maximum number of threads by accessing the “max_parallelism" value as follows. It is set right before inserting the SPMD task:

// Set maximum no. of threads per panel factorisation
core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;

Please find attached a performance plot of LU factorisation (with and without SPMD functionality) executed on a 20 core Haswell machine. I believe, something goes terribly wrong since the SPMD performance numbers are so low. I have used the following command to execute the tests:

export MKL_NUM_THTREADS=20
export OMP_NUM_THTREADS=20
export OMP_PROC_BIND=true
export STARPU_NCPU=20
export STARPU_SCHED=peager
export PLASMA_TUNING_FILENAME=...

numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=... —mtpf=... —iter=...

Any insight and help in recovering the performance numbers would be greatly appreciated. Thank you and have a good day!


Best wishes,
Maxim

<haswell_dgetrf2_starpu.pdf>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 5 Feb 2018, at 12:16, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Dear all,

I’m on a mission to apply the SPMD capability of the StarPU (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks) for a panel factorisation stage of the LU algorithm. Please see the figure attached for an example of my scenario.

The matrix is viewed as a set of tiles (rectangular or square matrix blocks). A column of tiles is called a panel.

In the first stage of the LU algorithm I would like to take a panel, find the pivots, swap the necessary rows, scale and update the underlying matrix elements. To track the dependencies I created tile descriptors, that keep the information about the access mode and the tile handle. Essentially, the tile descriptors are used to “lock” the entire panel, all the operations inside are parallelised manually using a custom barrier and auxiliary arrays, to store the maximum values and their indices. To be able to assign a particular task to a thread (processing the panel factorisation) I use ranks. Depending on a rank each thread will get its portion of the data to work on. Inside the panel threads are synchronised manually and wait for each other at the custom barrier.

Please pay attention to the attached figure. A panel consisting of five tiles is passed to the StarPU task. Imagine we have three treads processing the panel. To find the first pivot we assign the first column of each tile to a certain thread in the Round-Robin manner (0,1,2,0,1). Once the maximum per tile is found by each thread, the master thread (with rank 0) will select the global maximum. I would like to apply the SPMD capability of StarPU to process the panel and use a custom barrier inside.

Please consider the C code below. The code works, but the threads wait infinitely at the first barrier. My questions are:

1) Am I passing the barrier structure correctly, so that it is “shared" amongst all the threads and the threads “know” about the status of the other threads. To achieve this I pass the barrier structure by reference.
2) Maybe it is the tile descriptors that “block” the execution of the threads inside the panel? Maybe the threads with ranks 1, 2 can not proceed, since all the tiles are blocked by rank 0? Therefore, I can make a conclusion that “blocking” the tiles like I do is incorrect?
3) Is there a way to pass a variable to the codelet to set the “max_parallelism” value instead of hard-coding it?

4) If I may, I would like to make a general comment, please. I like StarPU very much. I think you have invested a great deal of time and effort into it. Thank you. But to my mind the weakest point (from my user experience) is passing the values to StarPU, while inserting a task. There is no type checking of the variables here. The same applies to the routine “starpu_codelet_unpack_args()”, when you want to obtain the values “on the other side”. Sometimes, it becomes a nightmare and a trial-and-error exercise. If the type checks could be enforced there, it would make a user’s life much easier.

// StarPU LU panel factorisation function
/******************************************************************************/
void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int *piv,
          volatile int *max_idx, volatile plasma_complex64_t *max_val,
          int ib, int rank, int mtpf, volatile int *info,
          plasma_barrier_t *barrier)
{

}

/******************************************************************************/
// StarPU ZGETRF SPMD CPU kernel
static void core_starpu_cpu_zgetrf_spmd(void *desc[], void *cl_arg) {

plasma_desc_t A;
int ib, mtpf, k, *piv;
volatile int *max_idx, *info;
volatile plasma_complex64_t *max_val;
plasma_barrier_t *barrier;

// Unpack scalar arguments
starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &mtpf,
                        &k, &info, &barrier);

int rank = starpu_combined_worker_get_rank();

// Array of pointers to subdiagonal tiles in panel k (incl. diagonal tile k)
plasma_complex64_t **pnlK =
 (plasma_complex64_t**) malloc((size_t)A.mt * sizeof(plasma_complex64_t*));
assert(pnlK != NULL);

printf("Panel: %d\n", k);

// Unpack tile data
for (int i = 0; i < A.mt; i++) {
 pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
}

// Unpack pivots vector
piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);

// Call computation kernel
core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
         ib, rank, mtpf, info, barrier);

// Deallocate container panel
free(pnlK);
}

/******************************************************************************/
// StarPU SPMD codelet
static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
{
.type            = STARPU_SPMD,
.max_parallelism = 2,
.cpu_funcs       = { core_starpu_cpu_zgetrf_spmd },
.cpu_funcs_name  = { "zgetrf_spmd" },
.nbuffers        = STARPU_VARIABLE_NBUFFERS,
};

/******************************************************************************/
// StarPU task inserter
void core_starpu_zgetrf_spmd(plasma_desc_t A, starpu_data_handle_t hPiv,
                      volatile int *max_idx, volatile plasma_complex64_t *max_val,
                      int ib, int mtpf, int k,
                      volatile int *info, plasma_barrier_t *barrier) {

// Pointer to first (top) tile in panel k
struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);

// Set access modes for subdiagonal tiles in panel k (incl. diagonal tile k)
for (int i = 0; i < A.mt; i++) {
 (pk+i)->mode = STARPU_RW;
}

int retval = starpu_task_insert(
 &core_starpu_codelet_zgetrf_spmd,
 STARPU_VALUE,               &A,         sizeof(plasma_desc_t),
 STARPU_DATA_MODE_ARRAY,      pk,        A.mt,
 STARPU_RW,                   hPiv,
 STARPU_VALUE,               &max_idx,   sizeof(volatile int*),
 STARPU_VALUE,               &max_val,   sizeof(volatile plasma_complex64_t*),
 STARPU_VALUE,               &ib,        sizeof(int),
 STARPU_VALUE,               &mtpf,      sizeof(int),
 STARPU_VALUE,               &k,         sizeof(int),
 STARPU_VALUE,               &info,      sizeof(volatile int*),
 STARPU_VALUE,               &barrier,   sizeof(plasma_barrier_t*),
 STARPU_NAME,                "zgetrf",
 0);

STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf: starpu_task_insert() failed");
}


Best wishes,
Maxim

<lu_panel_fact.jpg>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 24 Jan 2018, at 17:52, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Hello Samuel,

Thank you very much! Yes, in this particular use-case “STARPU_NONE” would come handy and make the source code much more “elegant”.


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 24 Jan 2018, at 17:47, Samuel Thibault <samuel.thibault@inria.fr> wrote:

Hello,

Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
I have a very simple question. What is the overhead of using the STARPU_NONE
access mode for some handles in the STARPU_DATA_MODE_ARRAY?

It is not implemented, we hadn't thought it could be useful. I have now
added it to the TODO list (but that list is very long and doesn't tend
to progress quickly).

The overhead would be quite small: StarPU would just write it down in
the array of data to fetch, and just not process that element. Of course
the theoretical complexity will be O(number of data).

In order to avoid using complicated offsets in my computation routines
I would like to pass them a column of matrix tiles, while setting the
“unused” tiles to “STARPU_NONE”.

I see.

Samuel










_______________________________________________
Starpu-devel mailing list
Starpu-devel@lists.gforge.inria.fr
https://lists.gforge.inria.fr/mailman/listinfo/starpu-devel








Archives gérées par MHonArc 2.6.19+.

Haut de le page