Objet : Developers list for StarPU
Archives de la liste
- From: Cyril Roelandt <cyril.roelandt@inria.fr>
- To: "starpu-devel@lists.gforge.inria.fr" <starpu-devel@lists.gforge.inria.fr>
- Subject: [Starpu-devel] Drivers, schedulers: where should sched_mutex be locked ?
- Date: Wed, 27 Jun 2012 16:48:38 +0200
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi !
I'm currently trying to write a simple work stealing module, that could
be plugged to our schedulers (obviously to the work stealing policy, but
also to the heft policy, or any other policy that may need to use such a
module).
The code would look like this:
/* In the scheduler */
pop()
{
/* Try to pop a task in a way that is specific to this scheduler. */
...
/* If no task has been popped, try the generic way. */
mod_ws_pop_task(worker);
}
/* In the work-stealing module: just try to pick a task, this is a dummy
implementation. */
mod_ws_pop_task(unsigned worker)
{
struct starpu_task *task = NULL;
for (workerid = 0; workerid < nworkers; workerid++)
{
struct _starpu_worker w = _starpu_get_worker_struct(workerid);
pthread_mutex_lock(w->sched_mutex);
task = _starpu_pop_local_task(w);
pthread_mutex_unlock(w->sched_mutex);
if (task)
break;
}
return task;
}
This leads to a deadlock. Let's look at the code from
src/drivers/cpu/driver_cpu.c:
int _starpu_cpu_driver_run_once(struct starpu_driver *d)
{
...
_STARPU_PTHREAD_MUTEX_LOCK(cpu_worker->sched_mutex);
struct _starpu_job *j;
struct starpu_task *task;
int res;
task = _starpu_pop_task(cpu_worker);
if (!task)
{
if (_starpu_worker_can_block(memnode))
_starpu_block_worker(workerid,
cpu_worker->sched_cond,
cpu_worker->sched_mutex);
_STARPU_PTHREAD_MUTEX_UNLOCK(cpu_worker->sched_mutex);
return 0;
};
_STARPU_PTHREAD_MUTEX_UNLOCK(cpu_worker->sched_mutex);
...
}
_starpu_pop_task() calls both _starpu_pop_local_task() and
policy.pop_task().
worker->sched_mutex needs to be taken to protect worker->local_tasks,
and so that the driver can wait on worker->sched_cond. But I do not
think we need the lock to be held when calling policy.pop_task().
So, if worker1 enters mod_ws_pop_task() and tries to lock
worker2->sched_mutex while worker2 enters mod_ws_pop_task() and tries to
lock worker1->sched_mutex, well, things obviously go wrong :)
The attached patch locks sched_mutex at two different places, rather
than once in the driver:
- in _starpu_pop_task(), before calling _starpu_pop_local_task();
- in the driver, before calling _starpu_block_worker().
I compiled with "--disable-cuda --disable-opencl", since I have not
modified the CUDA and OpenCL drivers yet. This breaks some
scheduler-related tests, but I think this should not be hard to fix (I
fixed eager). The test suite hangs with --enable-blocking-drivers, but I
think nobody has used this configure option in a gazillion years, and
the test-suite also hangs with an unmodified version of the trunk :/
Applying this patch would make us lock/unlock a bit more, but we would
also spend less time in the critical sections, so I'm not sure it's a
performance issue.
What do you think ?
Cyril.
diff --git a/trunk/src/core/sched_policy.c b/trunk/src/core/sched_policy.c index 4217ce2..dfd731f 100644 --- a/trunk/src/core/sched_policy.c +++ b/trunk/src/core/sched_policy.c @@ -388,7 +388,9 @@ struct starpu_task *_starpu_pop_task(struct _starpu_worker *worker) pick: /* perhaps there is some local task to be executed first */ + _STARPU_PTHREAD_MUTEX_LOCK(worker->sched_mutex); task = _starpu_pop_local_task(worker); + _STARPU_PTHREAD_MUTEX_UNLOCK(worker->sched_mutex); if (!task && policy.pop_task) task = policy.pop_task(); diff --git a/trunk/src/drivers/cpu/driver_cpu.c b/trunk/src/drivers/cpu/driver_cpu.c index 84ba8b2..9513ef6 100644 --- a/trunk/src/drivers/cpu/driver_cpu.c +++ b/trunk/src/drivers/cpu/driver_cpu.c @@ -157,7 +157,6 @@ int _starpu_cpu_driver_run_once(struct starpu_driver *d) _starpu_datawizard_progress(memnode, 1); _STARPU_TRACE_END_PROGRESS(memnode); - _STARPU_PTHREAD_MUTEX_LOCK(cpu_worker->sched_mutex); struct _starpu_job *j; struct starpu_task *task; @@ -168,14 +167,14 @@ int _starpu_cpu_driver_run_once(struct starpu_driver *d) if (!task) { if (_starpu_worker_can_block(memnode)) + { + _STARPU_PTHREAD_MUTEX_LOCK(cpu_worker->sched_mutex); _starpu_block_worker(workerid, cpu_worker->sched_cond, cpu_worker->sched_mutex); - - _STARPU_PTHREAD_MUTEX_UNLOCK(cpu_worker->sched_mutex); - + _STARPU_PTHREAD_MUTEX_UNLOCK(cpu_worker->sched_mutex); + } return 0; }; - _STARPU_PTHREAD_MUTEX_UNLOCK(cpu_worker->sched_mutex); STARPU_ASSERT(task); j = _starpu_get_job_associated_to_task(task); diff --git a/trunk/src/sched_policies/eager_central_policy.c b/trunk/src/sched_policies/eager_central_policy.c index 4b9da4d..eb6a092 100644 --- a/trunk/src/sched_policies/eager_central_policy.c +++ b/trunk/src/sched_policies/eager_central_policy.c @@ -64,7 +64,12 @@ static struct starpu_task *pop_every_task_eager_policy(void) static struct starpu_task *pop_task_eager_policy(void) { - return _starpu_fifo_pop_task(fifo, starpu_worker_get_id()); + struct _starpu_worker *worker = _starpu_get_worker_struct(starpu_worker_get_id()); + _STARPU_PTHREAD_MUTEX_LOCK(worker->sched_mutex); + struct starpu_task *task = _starpu_fifo_pop_task(fifo, starpu_worker_get_id()); + _STARPU_PTHREAD_MUTEX_UNLOCK(worker->sched_mutex); + return task; + } struct starpu_sched_policy _starpu_sched_eager_policy = diff --git a/trunk/tests/Makefile.am b/trunk/tests/Makefile.am index aad98ef..d40cd7d 100644 --- a/trunk/tests/Makefile.am +++ b/trunk/tests/Makefile.am @@ -218,11 +218,7 @@ noinst_PROGRAMS = \ parallel_tasks/parallel_kernels \ parallel_tasks/parallel_kernels_spmd \ perfmodels/regression_based \ - perfmodels/non_linear_regression_based \ - sched_policies/data_locality \ - sched_policies/execute_all_tasks \ - sched_policies/simple_deps \ - sched_policies/simple_cpu_gpu_sched + perfmodels/non_linear_regression_based if STARPU_HAVE_WINDOWS check_PROGRAMS = $(noinst_PROGRAMS) @@ -550,7 +546,5 @@ perfmodels_non_linear_regression_based_SOURCES+=\ perfmodels/opencl_memset.c endif -sched_policies_execute_all_tasks_LDFLAGS = -lm - showcheck: -cat $(TEST_LOGS) /dev/null
- [Starpu-devel] Drivers, schedulers: where should sched_mutex be locked ?, Cyril Roelandt, 27/06/2012
- Re: [Starpu-devel] Drivers, schedulers: where should sched_mutex be locked ?, Samuel Thibault, 27/06/2012
Archives gérées par MHonArc 2.6.19+.