Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] StarPU overhead for MPI

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] StarPU overhead for MPI


Chronologique Discussions 
  • From: Philippe SWARTVAGHER <philippe.swartvagher@inria.fr>
  • To: starpu-devel@lists.gforge.inria.fr
  • Cc: Alexandre Denis <alexandre.denis@inria.fr>
  • Subject: Re: [Starpu-devel] StarPU overhead for MPI
  • Date: Wed, 5 Feb 2020 14:55:27 +0100
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

As Samuel requested, here is the backtrace between nm_sr_send_isend and starpu_mpi_send:

#0  _starpu_mpi_isend_data_func (req=0x555555702db0)
    at ../../../mpi/src/nmad/starpu_mpi_nmad.c:123
#1  0x00007ffff7e0b360 in _starpu_mpi_submit_ready_request (
    arg=<optimized out>) at ../../../mpi/src/nmad/starpu_mpi_nmad.c:412
#2  0x00007ffff7ed5b96 in _starpu_data_acquire_fetch_data_callback (
    arg=0x5555556e29d0) at ../../src/datawizard/user_interactions.c:161
#3  0x00007ffff7ebf270 in _starpu_create_request_to_fetch_data (
    handle=handle@entry=0x555555753600,
    dst_replicate=dst_replicate@entry=0x555555753700,
    mode=mode@entry=STARPU_R, is_prefetch=is_prefetch@entry=0,
    async=async@entry=1,
    callback_func=callback_func@entry=0x7ffff7ed5b50 <_starpu_data_acquire_fetch_data_callback>, callback_arg=0x5555556e29d0, prio=0,
    origin=0x7ffff7f51bd0 "_starpu_data_acquire_launch_fetch")
    at ../../src/datawizard/coherency.c:532
#4  0x00007ffff7ebf77d in _starpu_fetch_data_on_node (handle=0x555555753600,
    node=0, dst_replicate=0x555555753700, mode=STARPU_R, detached=0,
    is_prefetch=0, async=1,
    callback_func=0x7ffff7ed5b50 <_starpu_data_acquire_fetch_data_callback>,
    callback_arg=0x5555556e29d0, prio=0,
    origin=0x7ffff7f51bd0 "_starpu_data_acquire_launch_fetch")
    at ../../src/datawizard/coherency.c:768
#5  0x00007ffff7ed61d6 in _starpu_data_acquire_launch_fetch (
    callback_arg=0x5555556e29d0,
    callback=0x7ffff7ed5b50 <_starpu_data_acquire_fetch_data_callback>,
    async=1, wrapper=0x5555556e29d0)
    at ../../src/datawizard/user_interactions.c:135
#6  _starpu_data_acquire_continuation_non_blocking (arg=0x5555556e29d0)
    at ../../src/datawizard/user_interactions.c:170
#7  starpu_data_acquire_cb_pre_sync_callback (arg=0x5555556e29d0)
    at ../../src/datawizard/user_interactions.c:185
#8  starpu_data_acquire_cb_pre_sync_callback (arg=0x5555556e29d0)
    at ../../src/datawizard/user_interactions.c:174
#9  0x00007ffff7e5c919 in _starpu_handle_job_termination (
    j=j@entry=0x55555573f7b0) at ../../src/core/jobs.c:446
#10 0x00007ffff7e9487c in _starpu_repush_task ()
    at ../../src/core/sched_policy.c:523
#11 0x00007ffff7e94e18 in _starpu_push_task (j=j@entry=0x55555573f7b0)
    at ../../src/core/sched_policy.c:428
#12 0x00007ffff7e5d6b0 in _starpu_enforce_deps_and_schedule (
    j=j@entry=0x55555573f7b0) at ../../src/core/jobs.c:653
#13 0x00007ffff7e602a6 in _starpu_submit_job (j=j@entry=0x55555573f7b0,
    nodeps=nodeps@entry=0) at ../../src/core/task.c:575
#14 0x00007ffff7e62c4c in _starpu_task_submit () at ../../src/core/task.c:957
#15 0x00007ffff7e63357 in starpu_task_submit (task=<optimized out>)
    at ../../src/core/task.c:978
#16 0x00007ffff7e63385 in _starpu_task_submit_internally (task=<optimized out>)
    at ../../src/core/task.c:985
#17 0x00007ffff7ed7005 in starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids (handle=0x555555753600, node=node@entry=0,
    mode=mode@entry=STARPU_R, callback=<optimized out>,
    arg=arg@entry=0x555555702db0,
    sequential_consistency=sequential_consistency@entry=1,
    quick=<optimized out>, pre_sync_jobid=<optimized out>,
    post_sync_jobid=<optimized out>)
    at ../../src/datawizard/user_interactions.c:250
#18 0x00007ffff7dfdffd in _starpu_mpi_isend_irecv_common (
    sequential_consistency=1, mode=STARPU_R, req=0x555555702db0)
    at ../../../mpi/src/starpu_mpi.c:74
#19 _starpu_mpi_isend_common (data_handle=data_handle@entry=0x555555753600,
    dest=dest@entry=0, data_tag=data_tag@entry=1, comm=<optimized out>,
    detached=detached@entry=0, sync=sync@entry=0, prio=0, callback=0x0,
    arg=0x0, sequential_consistency=1) at ../../../mpi/src/starpu_mpi.c:74
#20 0x00007ffff7dfe094 in _starpu_mpi_isend_common (sequential_consistency=1,

arg=0x0, callback=0x0, prio=<optimized out>, sync=0, detached=0,
    comm=<optimized out>, data_tag=1, dest=0, data_handle=0x555555753600)
    at ../../../mpi/src/starpu_mpi.c:85
#21 starpu_mpi_isend_prio (data_handle=0x555555753600,
    public_req=public_req@entry=0x7fffffffe158, dest=0, data_tag=1,
    prio=<optimized out>, comm=<optimized out>)
    at ../../../mpi/src/starpu_mpi.c:85
#22 0x00007ffff7dfe745 in starpu_mpi_send_prio (data_handle=<optimized out>,
    dest=<optimized out>, data_tag=<optimized out>, prio=<optimized out>,
    comm=<optimized out>) at ../../../mpi/src/starpu_mpi.c:121
#23 0x0000555555555414 in main (argc=<optimized out>, argv=<optimized out>)
    at ../../../mpi/tests/sendrecv_bench.c:140


As reminder: the traversal of this stack can last 14µs according to FxT traces.


I benched also a version of this benchmark with a pure MPI code, with OpenMPI 3 and OpenMPI 4: both curves are (a little bit) below the curve of pure NewMadeleine.



Le 29/01/2020 à 17:19, Philippe SWARTVAGHER a écrit :
Hello,

Attached to this mail, a plot resulting of the execution of mpi/tests/sendrecv_bench. This plot show latency of data transfers, according to data size. The reference curve is raw NewMadeleine nmad/examples/benchmarks/nm_bench_sendrecv. These two programs are the same, except that in StarPU bench, measured period is between just before starpu_mpi_send and just after starpu_mpi_recv. So this plot should show the StarPU-MPI overhead.

The overhead seems a little bit huge (a small comm lasts 10 times longer). Do you know what could be the origin of this overhead ?

StarPU was compiled without FXT and without --enable-debug.

I tried adding events in trace, to see which function could consume a lot of time. I tried this on dalton/joe01. A transfer of 0 byte lasts 109µs in median and average. Nothing really relevant appeared, but between just before the call to starpu_mpi_recv and just after the call to nm_sr_recv_irecv, 14µs elapsed.


Any idea ?

--
Philippe SWARTVAGHER

Doctorant
Équipe TADaaM, Inria Bordeaux Sud-Ouest





Archives gérées par MHonArc 2.6.19+.

Haut de le page