Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] [LU factorisation: gdb debug output]

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] [LU factorisation: gdb debug output]


Chronologique Discussions 
  • From: Maxim Abalenkov <maxim.abalenkov@gmail.com>
  • To: Olivier Aumage <olivier.aumage@inria.fr>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] [LU factorisation: gdb debug output]
  • Date: Tue, 13 Feb 2018 21:44:33 +0000
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=maxim.abalenkov@gmail.com; spf=Pass smtp.mailfrom=maxim.abalenkov@gmail.com; spf=None smtp.helo=postmaster@mail-wr0-f195.google.com
  • Ironport-phdr: 9a23:RVvwWR/l6M+aIv9uRHKM819IXTAuvvDOBiVQ1KB20+8cTK2v8tzYMVDF4r011RmVBdyds6oMotGVmpioYXYH75eFvSJKW713fDhBt/8rmRc9CtWOE0zxIa2iRSU7GMNfSA0tpCnjYgBaF8nkelLdvGC54yIMFRXjLwp1Ifn+FpLPg8it2O2+54Dfbx9UiDahfLh/MAi4oQLNu8cMnIBsMLwxyhzHontJf+RZ22ZlLk+Nkhj/+8m94odt/zxftPw9+cFAV776f7kjQrxDEDsmKWE169b1uhTFUACC+2ETUmQSkhpPHgjF8BT3VYr/vyfmquZw3jSRMNboRr4oRzut86ZrSAfpiCgZMT457HrXgdF0gK5CvR6tuwBzz4vSbYqINvRxY7ndcMsaS2RCQsleWDFPDI2+YIURAeoPOv1VoJPhq1sLtxa+BRWgCeHpxzRVhnH2x6o60+E5HA/BxgMhENMOsHHJp9jpL6gdS+S1w7fOzTXAaPNWxyr25Y/Nch87rvCMXLdwfdDLxkY0DQzFikufqYrmPzOSyOQAqGeb7+96WuKuj24rsR1+oj+qxso1jITCm4wbylfB9SpjwYY1I8W1SE99Yd6+EZtfrTuWN4VsQs86TGFouTo6yrkctpGgZiQKyZMnyhjCYPKEa4iF+gzvWPqVLDtih39oeKiziwus/UWj0OHwS8253VZSoiZYnNTAqmoB2h/Q58SdV/dw/Ues1SyS2w3S5exJJ10/m7DBJJ472LEwk4IesUTdES/yn0X7lKqWeV8l+uis8ujmbK/mqoOFO496lw3zNqQjltawAeQ/NQgOUGyb9vqm2LL/+k35Ra1GjvwwkqbHrJDXPdoXqrK9DgNP0Ysu6wyzAyq43Nkbh3ULMVZIdRKfg4jsIV7OIfT4Dfmlg1SrlTdm3/XGMafuA5XMK3jPiq3ucq1n5E5Y1gUzy9Nf55VKCrwaL/LzX1X+tN3cDhMjLwO0xOPnBM1n1owCQWKPHrOZMKTKvF+T+uIgOfOMZJcIuDrkNvcq+eDugmE9mVIGeamp3IAXaGyjHvh8LEWZb33sgs0OEWgUpAY+TerqiEeDUTFJfXqyUbg8tXkHD9eBCYbdR423yJWc2S60E9UCYWRHFFGBCzHhaoCNVPokaSSII8YnnCZSBpa7TIp08Bi+tQqy4rBuKufS+SoG/cbm3ch04qvanxg2+Dp3CNq13GSETmUylWQNEWxllJtjqFBwnw/QmZNzhOZVQJkKv6sQA1UKcKXExuk/MOjcHwfIf9OHUlGjG4z0DjQ4T9Z3yNgLMR8kR4eSyyvb1i/vOIc70qSRDcVtoK3Z1nn1Yc16ziSejfRzvxwdWsJKcFaeqOt/+gzUXdCblkyYk+O1b/xZ0necqSGMym2BuEweWwl1A/3I
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Dear Olivier,

Thank you very much for your help and explanation. Please find attached an updated performance plot. I have included the data from application of the “lws" scheduling policy and 1 thread per panel factorisation. It is interesting to note, that the performance of the algorithm under the “lws" almost reaches the performance of “peager”. Which shows that “peager” is potentially suboptimal.

I have also tried to experiment with the Fork--Join approach and “peager", but unfortunately without any success. The code hangs and does not proceed. From my own debugging it seems at the first iteration, for the first panel factorisation, only a master thread is launched instead of 4 threads. For the subsequent panels all 4 threads enter the panel factorisation kernel, the “core” routine, but then they hang. The same happens when I use another parallel scheduling policy “pheft”.

The StarPU functions that I use to test the Fork--Join approach are given below:

/******************************************************************************/
// StarPU ZGETRF CPU kernel (Fork-Join)
static void core_starpu_cpu_zgetrf_fj(void *desc[], void *cl_arg) {

    plasma_desc_t A;
    int ib, k, *piv;
    volatile int *max_idx, *info;
    volatile plasma_complex64_t *max_val;
    plasma_barrier_t *barrier;

    // Unpack scalar arguments
    starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &k,
                               &info, &barrier);

    int mtpf = starpu_combined_worker_get_size();

    // Array of pointers to subdiagonal tiles in panel k (incl. diagonal tile k)
    plasma_complex64_t **pnlK =
        (plasma_complex64_t**) malloc((size_t)A.mt * sizeof(plasma_complex64_t*));
    assert(pnlK != NULL);

    // Unpack tile data
    for (int i = 0; i < A.mt; i++) {
        pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
    }

    // Unpack pivots vector
    piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);

    // Call computation kernel
    #pragma omp parallel
    #pragma omp master
    {
        #pragma omp taskloop untied shared(barrier) num_tasks(mtpf) priority(2)
        for (int rank = 0; rank < mtpf; rank++) {
            core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
                        ib, rank, mtpf, info, barrier);
        }
    }

    // Deallocate container panel
    free(pnlK);
}

/******************************************************************************/
// StarPU codelet (Fork-Join)
static struct starpu_codelet core_starpu_codelet_zgetrf_fj =
{
    .where                  = STARPU_CPU,
    .type                     = STARPU_FORKJOIN,
    .cpu_funcs            = { core_starpu_cpu_zgetrf_fj },
    .cpu_funcs_name = { "zgetrf_fj" },
    .nbuffers               = STARPU_VARIABLE_NBUFFERS,
    .name                   = "zgetrf_cl_fj"
};

/******************************************************************************/
// StarPU task inserter (Fork-Join)
void core_starpu_zgetrf_fj(plasma_desc_t A, starpu_data_handle_t hPiv,
                           volatile int *max_idx, volatile plasma_complex64_t *max_val,
                           int ib, int mtpf, int k, int prio,
                           volatile int *info, plasma_barrier_t *barrier) {

    // Set maximum no. of threads per panel factorisation
    core_starpu_codelet_zgetrf_fj.max_parallelism = mtpf;

    // Pointer to first (top) tile in panel k
    struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);

    // Set access modes for subdiagonal tiles in panel k (incl. diagonal tile k)
    for (int i = 0; i < A.mt; i++) {
        (pk+i)->mode = STARPU_RW;
    }

    int retval = starpu_task_insert(
        &core_starpu_codelet_zgetrf_fj,
        STARPU_VALUE,               &A,         sizeof(plasma_desc_t),
        STARPU_DATA_MODE_ARRAY,      pk,        A.mt,
        STARPU_RW,                   hPiv,
        STARPU_VALUE,               &max_idx,   sizeof(volatile int*),
        STARPU_VALUE,               &max_val,   sizeof(volatile plasma_complex64_t*),
        STARPU_VALUE,               &ib,        sizeof(int),
        STARPU_VALUE,               &k,         sizeof(int),
        STARPU_VALUE,               &info,      sizeof(volatile int*),
        STARPU_VALUE,               &barrier,   sizeof(plasma_barrier_t*),
        STARPU_NAME,                "zgetrf_fj",
        0);

    STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf_fj: starpu_task_insert() failed");
}

Best wishes,
Maxim

Attachment: haswell_dgetrf_starpu_spmd_lws.pdf
Description: Adobe PDF document


Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 13 Feb 2018, at 18:09, Olivier Aumage <olivier.aumage@inria.fr> wrote:

[missing forward to the list]

Début du message réexpédié :

De: Olivier Aumage <olivier.aumage@inria.fr>
Objet: Rép : [Starpu-devel] [LU factorisation: gdb debug output]
Date: 13 février 2018 à 19:06:01 UTC+1
À: Maxim Abalenkov <maxim.abalenkov@gmail.com>

Hi Maxim,

It is actually expected that the patch benefit is low.

The main issue with 'peager' is that the initialization phase builds some table indicating, for each worker, who is the master of its parallel team. However, this is iterated, for teams of increasing sizes up to the team containing all workers. Thus, every worker ends up being assigned the worker 0 as master worker. The result is that only the worker 0 fetches parallel tasks in the unpatched version, and the tasks are therefore serialized with respect to each other. This is why you obtained a flat scalability plot with that version.

The small patch I sent simply limit the size of the worker teams, to avoid having every worker to be under the control of worker 0. I put an arbitrary limit of 4 workers per group in the patch.

Of course, this is only temporary. I am away from office this week, and I need to check with some colleagues about why the code was organized this way, and if, perhaps, the peager implementation made some assumptions about some other parts of the StarPU core that are no longer true.

Best regards,
--
Olivier

Le 13 févr. 2018 à 17:36, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Dear Olivier,

Please find attached a plot of my experiments with various no. of SPMD threads working on the LU panel factorisation. Using more threads is beneficial, but unfortunately, the benefit is miniscule. I have also implemented the Fork—Join approach wrapping around the panel factorisation done by OpenMP. I will show you the results soon. Thank you very much for your help!


Best wishes,
Maxim

<haswell_dgetrf_starpu_spmd.pdf>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 12 Feb 2018, at 22:01, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

Regarding the issue with 'peager' scalability, the unpatched master branch should be similar to the 1.2.3 version. However, since 'peager' is still considered experimental, it is probably better to switch to the master branch, as fixes will likely arrive there first.

Best regards,
--
Olivier

Le 12 févr. 2018 à 22:50, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Hello Olivier,

I’m using the version 1.2.3, downloaded from the INRIA website. Would it be better to use the “rolling” edition? I will install it tomorrow morning!


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 12 Feb 2018, at 21:46, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

My patch was against the StarPU's master branch as of Saturday morning. Which version of StarPU are you currently using?

Best regards,
--
Olivier

Le 12 févr. 2018 à 16:20, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Hello Olivier,

Thank you very much for your reply and the patch. I have applied the patch to the code and will re-run the experiments. I will get back to you with the results. I think one of the changes in the patch wasn’t successful. Please find below the output of the patch command and the file with the rejects. Thank you and have a good day ahead!


Best wishes,
Maksims

<log>
<parallel_eager.c.rej>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 10 Feb 2018, at 11:17, Olivier Aumage <olivier.aumage@inria.fr> wrote:

Hi Maxim,

I am not familiar with the peager implementation of StarPU (nor, I believe, Samuel). I have had a quick look at the peager policy code, and there seems to be an issue with the initialization phase of the policy. Or perhaps I do not get the rationale of it...

Can you check if the quick patch in attachment improve the scalability of your code? You can apply it with the following command:
$ patch -p1 <../peager.patch

This is only meant to be a temporary fix, however. I need to check with people who wrote the code about what the initial intent was.

Hope this helps.
Best regards,
--
Olivier

<peager.patch>

Le 8 févr. 2018 à 16:19, Maxim Abalenkov <maxim.abalenkov@gmail.com> a écrit :

Dear all,

I have implemented the parallel panel factorisation in LU with the StarPU’s SPMD capability. Here are a few answers to my own questions:

1) Am I passing the barrier structure correctly, so that it is “shared" amongst all the threads and the threads “know” about the status of the other threads. To achieve this I pass the barrier structure by reference.

Yes, it is passed correctly. All other threads "know about” and share the values inside the barrier structure.

2) Maybe it is the tile descriptors that “block” the execution of the threads inside the panel? Maybe the threads with ranks 1, 2 can not proceed, since all the tiles are blocked by rank 0? Therefore, I can make a conclusion that “blocking” the tiles like I do is incorrect?

Tile “blocking” is correct. The problem did not lie in the tile “blocking”, but rather in the application of a non-parallel StarPU scheduler. According to the StarPU handbook only two schedulers “pheft” and “peager” support the SPMD mode of execution.

3) Is there a way to pass a variable to the codelet to set the “max_parallelism” value instead of hard-coding it?

Since the codelet is a static structure, I am setting the maximum number of threads by accessing the “max_parallelism" value as follows. It is set right before inserting the SPMD task:

// Set maximum no. of threads per panel factorisation
core_starpu_codelet_zgetrf_spmd.max_parallelism = mtpf;

Please find attached a performance plot of LU factorisation (with and without SPMD functionality) executed on a 20 core Haswell machine. I believe, something goes terribly wrong since the SPMD performance numbers are so low. I have used the following command to execute the tests:

export MKL_NUM_THTREADS=20
export OMP_NUM_THTREADS=20
export OMP_PROC_BIND=true
export STARPU_NCPU=20
export STARPU_SCHED=peager
export PLASMA_TUNING_FILENAME=...

numactl --interleave=all ./test dgetrf —dim=… —nb=… —ib=... —mtpf=... —iter=...

Any insight and help in recovering the performance numbers would be greatly appreciated. Thank you and have a good day!


Best wishes,
Maxim

<haswell_dgetrf2_starpu.pdf>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 5 Feb 2018, at 12:16, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Dear all,

I’m on a mission to apply the SPMD capability of the StarPU (http://starpu.gforge.inria.fr/doc/html/TasksInStarPU.html#ParallelTasks) for a panel factorisation stage of the LU algorithm. Please see the figure attached for an example of my scenario.

The matrix is viewed as a set of tiles (rectangular or square matrix blocks). A column of tiles is called a panel.

In the first stage of the LU algorithm I would like to take a panel, find the pivots, swap the necessary rows, scale and update the underlying matrix elements. To track the dependencies I created tile descriptors, that keep the information about the access mode and the tile handle. Essentially, the tile descriptors are used to “lock” the entire panel, all the operations inside are parallelised manually using a custom barrier and auxiliary arrays, to store the maximum values and their indices. To be able to assign a particular task to a thread (processing the panel factorisation) I use ranks. Depending on a rank each thread will get its portion of the data to work on. Inside the panel threads are synchronised manually and wait for each other at the custom barrier.

Please pay attention to the attached figure. A panel consisting of five tiles is passed to the StarPU task. Imagine we have three treads processing the panel. To find the first pivot we assign the first column of each tile to a certain thread in the Round-Robin manner (0,1,2,0,1). Once the maximum per tile is found by each thread, the master thread (with rank 0) will select the global maximum. I would like to apply the SPMD capability of StarPU to process the panel and use a custom barrier inside.

Please consider the C code below. The code works, but the threads wait infinitely at the first barrier. My questions are:

1) Am I passing the barrier structure correctly, so that it is “shared" amongst all the threads and the threads “know” about the status of the other threads. To achieve this I pass the barrier structure by reference.
2) Maybe it is the tile descriptors that “block” the execution of the threads inside the panel? Maybe the threads with ranks 1, 2 can not proceed, since all the tiles are blocked by rank 0? Therefore, I can make a conclusion that “blocking” the tiles like I do is incorrect?
3) Is there a way to pass a variable to the codelet to set the “max_parallelism” value instead of hard-coding it?

4) If I may, I would like to make a general comment, please. I like StarPU very much. I think you have invested a great deal of time and effort into it. Thank you. But to my mind the weakest point (from my user experience) is passing the values to StarPU, while inserting a task. There is no type checking of the variables here. The same applies to the routine “starpu_codelet_unpack_args()”, when you want to obtain the values “on the other side”. Sometimes, it becomes a nightmare and a trial-and-error exercise. If the type checks could be enforced there, it would make a user’s life much easier.

// StarPU LU panel factorisation function
/******************************************************************************/
void core_zgetrf(plasma_desc_t A, plasma_complex64_t **pnl, int *piv,
             volatile int *max_idx, volatile plasma_complex64_t *max_val,
             int ib, int rank, int mtpf, volatile int *info,
             plasma_barrier_t *barrier)
{

}

/******************************************************************************/
// StarPU ZGETRF SPMD CPU kernel
static void core_starpu_cpu_zgetrf_spmd(void *desc[], void *cl_arg) {

plasma_desc_t A;
int ib, mtpf, k, *piv;
volatile int *max_idx, *info;
volatile plasma_complex64_t *max_val;
plasma_barrier_t *barrier;

// Unpack scalar arguments
starpu_codelet_unpack_args(cl_arg, &A, &max_idx, &max_val, &ib, &mtpf,
                           &k, &info, &barrier);

int rank = starpu_combined_worker_get_rank();

// Array of pointers to subdiagonal tiles in panel k (incl. diagonal tile k)
plasma_complex64_t **pnlK =
    (plasma_complex64_t**) malloc((size_t)A.mt * sizeof(plasma_complex64_t*));
assert(pnlK != NULL);

printf("Panel: %d\n", k);

// Unpack tile data
for (int i = 0; i < A.mt; i++) {
    pnlK[i] = (plasma_complex64_t *) STARPU_MATRIX_GET_PTR(desc[i]);
}

// Unpack pivots vector
piv = (int *) STARPU_VECTOR_GET_PTR(desc[A.mt]);

// Call computation kernel
core_zgetrf(A, pnlK, &piv[k*A.mb], max_idx, max_val,
            ib, rank, mtpf, info, barrier);

// Deallocate container panel
free(pnlK);
}

/******************************************************************************/
// StarPU SPMD codelet
static struct starpu_codelet core_starpu_codelet_zgetrf_spmd =
{
.type            = STARPU_SPMD,
.max_parallelism = 2,
.cpu_funcs       = { core_starpu_cpu_zgetrf_spmd },
.cpu_funcs_name  = { "zgetrf_spmd" },
.nbuffers        = STARPU_VARIABLE_NBUFFERS,
};

/******************************************************************************/
// StarPU task inserter
void core_starpu_zgetrf_spmd(plasma_desc_t A, starpu_data_handle_t hPiv,
                         volatile int *max_idx, volatile plasma_complex64_t *max_val,
                         int ib, int mtpf, int k,
                         volatile int *info, plasma_barrier_t *barrier) {

// Pointer to first (top) tile in panel k
struct starpu_data_descr *pk = &(A.tile_desc[k*(A.mt+k+1)]);

// Set access modes for subdiagonal tiles in panel k (incl. diagonal tile k)
for (int i = 0; i < A.mt; i++) {
    (pk+i)->mode = STARPU_RW;
}

int retval = starpu_task_insert(
    &core_starpu_codelet_zgetrf_spmd,
    STARPU_VALUE,               &A,         sizeof(plasma_desc_t),
    STARPU_DATA_MODE_ARRAY,      pk,        A.mt,
    STARPU_RW,                   hPiv,
    STARPU_VALUE,               &max_idx,   sizeof(volatile int*),
    STARPU_VALUE,               &max_val,   sizeof(volatile plasma_complex64_t*),
    STARPU_VALUE,               &ib,        sizeof(int),
    STARPU_VALUE,               &mtpf,      sizeof(int),
    STARPU_VALUE,               &k,         sizeof(int),
    STARPU_VALUE,               &info,      sizeof(volatile int*),
    STARPU_VALUE,               &barrier,   sizeof(plasma_barrier_t*),
    STARPU_NAME,                "zgetrf",
    0);

STARPU_CHECK_RETURN_VALUE(retval, "core_starpu_zgetrf: starpu_task_insert() failed");
}


Best wishes,
Maxim

<lu_panel_fact.jpg>

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 24 Jan 2018, at 17:52, Maxim Abalenkov <maxim.abalenkov@gmail.com> wrote:

Hello Samuel,

Thank you very much! Yes, in this particular use-case “STARPU_NONE” would come handy and make the source code much more “elegant”.


Best wishes,
Maxim

Maxim Abalenkov \\ maxim.abalenkov@gmail.com
+44 7 486 486 505 \\ http://mabalenk.gitlab.io

On 24 Jan 2018, at 17:47, Samuel Thibault <samuel.thibault@inria.fr> wrote:

Hello,

Maxim Abalenkov, on lun. 15 janv. 2018 18:04:48 +0000, wrote:
I have a very simple question. What is the overhead of using the STARPU_NONE
access mode for some handles in the STARPU_DATA_MODE_ARRAY?

It is not implemented, we hadn't thought it could be useful. I have now
added it to the TODO list (but that list is very long and doesn't tend
to progress quickly).

The overhead would be quite small: StarPU would just write it down in
the array of data to fetch, and just not process that element. Of course
the theoretical complexity will be O(number of data).

In order to avoid using complicated offsets in my computation routines
I would like to pass them a column of matrix tiles, while setting the
“unused” tiles to “STARPU_NONE”.

I see.

Samuel












Archives gérées par MHonArc 2.6.19+.

Haut de le page