Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] Worker Binding Problem

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] Worker Binding Problem


Chronologique Discussions 
  • From: Berenger Bramas <berenger.bramas@inria.fr>
  • To: Samuel Thibault <samuel.thibault@inria.fr>
  • Cc: starpu-devel@lists.gforge.inria.fr
  • Subject: Re: [Starpu-devel] Worker Binding Problem
  • Date: Wed, 9 Sep 2015 11:42:01 +0200 (CEST)
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Thanks for you answer.

Yes, my CPU has two cores (and four threads using hyperthreading),
The thing is that I try to make the same behavior for my application from the
user point of view using OpenMP or StarPU,
and I was expecting that the first threads are bind over the different cores
and then over the different SMT (if enabled)
considering it may not be better to use the SMT but it may not be worse too.

In my case with a very optimized kernel based on AVX (I remind with the same
amount of work per thread)
In sequential the thread takes 0.757221s
using 4 threads with SMT 1.3s
but using 4 threads without SMT 1.5s.

So that is OK, in my case if the user ask for more threads than the number of
cores because he want to use SMT I will do a manual binding.

But I still have some troubles on a 2 Dodeca-core Haswell Intel® Xeon®
E5-2680 (with hyperthreading disabled)(notice I updated my hwlock to 1.11)
I attach to this email a very simple test case which simply perform a spin
loop (with a volatile to consume some memory traffic)
The OpenMP Code is as follow
===========================================================
// Execute computation with openmp
#pragma omp parallel num_threads(numThreads)
{
// Wait all
#pragma omp barrier
// Compute
const double startTime = omp_get_wtime();
work();
const double endTime = omp_get_wtime();

// Print time
#pragma omp critical(PRINT)
{
std::cout << "[" << omp_get_thread_num() << "]Done " << " = " <<
(endTime-startTime) << "s" << std::endl;
}
}
===========================================================


Compilation is done with:
$ g++ -Wall -fopenmp -std=c++11 -mavx testStarPUOpenMPv2.cpp -o
testStarPUOpenMPv2.exe -lgomp
-I/projets/scalfmm/Starpu/StarPU/installwithfxtverbose/include/starpu/1.3/
-L/projets/scalfmm/Starpu/StarPU/installwithfxtverbose/lib/ -lstarpu-1.3
The number of threads can be chosen from the openmp env variable
OMP_NUM_THREADS:
$ OMP_NUM_THREADS=24 OMP_PROC_BIND=TRUE ./testStarPUOpenMPv2.exe
>From the results below we can see that some worker failed to do their job in
>the normal time.

With two threads it is OK:
Test with OpenMP:
[0]Done = 0.826635s
[1]Done = 0.828059s
Starpu:
[0]Done = 0.813087s
[1]Done = 0.826654s

With three is start to be strange:
Test with OpenMP:
[0]Done = 0.825707s
[2]Done = 0.826624s
[1]Done = 0.827749s
Starpu:
[2]Done = 0.826262s
[0]Done = 0.826653s
[1]Done = 1.64255s

With 8:
Test with OpenMP:
[0]Done = 0.819424s
[6]Done = 0.826602s
[4]Done = 0.826648s
[5]Done = 0.826813s
[3]Done = 0.826964s
[1]Done = 0.827056s
[2]Done = 0.827338s
[7]Done = 0.827339s
Starpu:
[5]Done = 0.825804s
[4]Done = 0.82598s
[7]Done = 0.826542s
[6]Done = 0.826605s
[0]Done = 0.826869s
[2]Done = 1.6363s
[1]Done = 1.64211s
[3]Done = 1.64718s

Finally with 24:
Test with OpenMP:
[0]Done = 0.814564s
[23]Done = 0.826475s
[9]Done = 0.826668s
[22]Done = 0.826724s
[16]Done = 0.826892s
[18]Done = 0.82697s
[7]Done = 0.826996s
[17]Done = 0.827005s
[12]Done = 0.827061s
[20]Done = 0.827099s
[13]Done = 0.827134s
[1]Done = 0.827237s
[15]Done = 0.82724s
[11]Done = 0.827262s
[21]Done = 0.82729s
[19]Done = 0.827358s
[6]Done = 0.827403s
[4]Done = 0.827468s
[10]Done = 0.827473s
[8]Done = 0.827505s
[14]Done = 0.827493s
[3]Done = 0.827588s
[5]Done = 0.827671s
[2]Done = 0.829008s
StarPU:
[0]Done = 0.82736s
[17]Done = 1.61875s
[19]Done = 1.6194s
[4]Done = 1.63148s
[23]Done = 1.63316s
[22]Done = 1.63799s
[18]Done = 1.63861s
[6]Done = 1.63878s
[3]Done = 1.63924s
[11]Done = 1.63968s
[15]Done = 1.64206s
[7]Done = 1.64275s
[12]Done = 1.64313s
[5]Done = 1.64386s
[13]Done = 1.64457s
[21]Done = 1.64462s
[20]Done = 1.64464s
[1]Done = 1.64517s
[8]Done = 1.64626s
[10]Done = 1.64649s
[14]Done = 1.64715s
[2]Done = 1.64786s
[9]Done = 1.65221s
[16]Done = 1.64785s

The binding is correct (from logical CPU 0 to 23)
So I do not know what I am doing wrong but openmp do the job as expected and
starpu does not.
So must have forgot something in my code, but I have the same behavior with
my real application.

Thanks.

Bérenger Bramas

HiePACS Project

Tel (05 24 57) 40 76
INRIA BORDEAUX Sud Ouest


----- Mail original -----
| De: "Samuel Thibault" <samuel.thibault@inria.fr>
| À: "Berenger Bramas" <berenger.bramas@inria.fr>
| Cc: starpu-devel@lists.gforge.inria.fr
| Envoyé: Mercredi 9 Septembre 2015 10:13:04
| Objet: Re: [Starpu-devel] Worker Binding Problem
|
| Berenger Bramas, le Wed 09 Sep 2015 10:08:06 +0200, a écrit :
| > My processor is a: Intel® Core™ i7-4610M CPU @ 3.00GHz × 4
|
| This CPU has only two cores, doesn't it? (I don't think anybody has
| worked on distributing workers on the threads of the same cores) Or
| do you mean you have 4 sockets and thus 8 cores? Do you pass any
| environment variable to the program in the failing case?
|
| Samuel
|
//
// Compilation:
// g++ -Wall -fopenmp -std=c++11 -mavx testStarPUOpenMPv2.cpp -o testStarPUOpenMPv2.exe -lgomp -I/projets/scalfmm/Starpu/StarPU/installwithfxtverbose/include/starpu/1.3/ -L/projets/scalfmm/Starpu/StarPU/installwithfxtverbose/lib/ -lstarpu-1.3
//
// Examples:
// OMP_NUM_THREADS=4 ./testStarPUOpenMPv2.exe
// OMP_NUM_THREADS=3 OMP_PROC_BIND=TRUE numactl -l ./testStarPUOpenMPv2.exe


#include <cassert>
#include <memory>
#include <omp.h>
#include <cstdlib>
#include <unistd.h>
#include <iostream>
#include <cstring>

#include <starpu.h>



static void BindToFunc(void */*buffers*/[], void *cl_arg){
    void* ptr;
    starpu_codelet_unpack_args(cl_arg, &ptr);
    std::function<void(void)>* func = (std::function<void(void)>*) ptr;
    (*func)();
}

void work(){
    size_t Nb = 100000000;
    volatile size_t idx = 0;
    while(idx != Nb){
        idx += 1;
        idx -= 1;
        idx += 1;
    }
}

void runs(){
    const int numThreads = omp_get_max_threads();


    std::cout << "Test with OpenMP:" << std::endl;

    // Execute computation with openmp
    #pragma omp parallel num_threads(numThreads)
    {
        // Wait all
        #pragma omp barrier
        // Compute
        const double startTime = omp_get_wtime();
        work();
        const double endTime = omp_get_wtime();

        // Print time
        #pragma omp critical(PRINT)
        {
            std::cout << "[" << omp_get_thread_num() << "]Done  " << " = " << (endTime-startTime) << "s" << std::endl;
        }
    }

    std::cout << "Test with StarPU:" << std::endl;

    {
        // Init starpu
        struct starpu_conf conf;
        assert(starpu_conf_init(&conf) == 0);
        // Use the same number of threads
        conf.ncpus = numThreads;
        assert(starpu_init(&conf) == 0);

        // We need a barrier and a mutex
        starpu_pthread_barrier_t barr;
        assert(starpu_pthread_barrier_init(&barr, NULL, numThreads) == 0);
        starpu_pthread_mutex_t printMutex;
        starpu_pthread_mutex_init(&printMutex, NULL);

        // Core part
        std::function<void(void)> func = [&](){
            // Barrier between workers
            int ret = starpu_pthread_barrier_wait(&barr);
            assert(ret == 0 || ret == PTHREAD_BARRIER_SERIAL_THREAD);
            // Compute
            const double startTime = omp_get_wtime();
            work();
            const double endTime = omp_get_wtime();
            // Print res
            starpu_pthread_mutex_lock(&printMutex);
            {
                std::cout << "[" << starpu_worker_get_id() << "]Done  " << " = " << (endTime-startTime) << "s" << std::endl;
            }
            starpu_pthread_mutex_unlock(&printMutex);
        };

        // Create a codelete to call the functional
        starpu_codelet perWorkerCodelet;
        memset(&perWorkerCodelet, 0, sizeof(perWorkerCodelet));
        perWorkerCodelet.cpu_funcs[0] = BindToFunc;
        perWorkerCodelet.where |= STARPU_CPU;
        perWorkerCodelet.nbuffers = 0;
        perWorkerCodelet.name = "perWorkerCodelet";

        // Insert one task per worker
        for(int idxThread = 0 ; idxThread < numThreads ; ++idxThread){
            struct starpu_task* const task = starpu_task_create();
            task->cl = &perWorkerCodelet;
            // Store args values
            void* funcptr = (void*)&func;
            starpu_codelet_pack_args((void**)&task->cl_arg, &task->cl_arg_size,
                                     STARPU_VALUE, &funcptr, sizeof(void*),
                                     0);
            // This task is only for one worker
            task->execute_on_a_specific_worker = 1;
            task->workerid = idxThread;
            assert(starpu_task_submit(task) == 0);
        }

        // Wait all
        starpu_task_wait_for_all();
        starpu_shutdown();
        // Dealloc
        starpu_pthread_mutex_destroy(&printMutex);
        starpu_pthread_barrier_destroy(&barr);
    }
}

/////////////////////////////////////////////////////////////////////////
/// Main
/////////////////////////////////////////////////////////////////////////


int main(int argc, char* argv[]){
    runs();
    return 0;
}









Archives gérées par MHonArc 2.6.19+.

Haut de le page