starpu-devel - [Starpu-devel] Overlapping communications in modular-heft

Objet : Developers list for StarPU

Archives de la liste

[Starpu-devel] Overlapping communications in modular-heft

From: Lionel Eyraud-Dubois <lionel.eyraud-dubois@inria.fr>
To: starpu-devel@lists.gforge.inria.fr
Subject: [Starpu-devel] Overlapping communications in modular-heft
Date: Tue, 16 May 2017 13:57:24 +0200
Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=lionel.eyraud-dubois@inria.fr; spf=None smtp.mailfrom=lionel.eyraud-dubois@inria.fr; spf=None smtp.helo=postmaster@v-zimmta03.u-bordeaux.fr
Ironport-phdr: 9a23:Z1xvKx1qEzoy3+OasmDT+DRfVm0co7zxezQtwd8ZseIVK/ad9pjvdHbS+e9qxAeQG96Kt7Qc06L/iOPJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQFcVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q89pDXbAhEniaxba9vJxiqsAvdsdUbj5F/Iagr0BvJpXVIe+VSxWx2IF+Yggjx6MSt8pN96ipco/0u+dJOXqX8ZKQ4UKdXDC86PGAv5c3krgfMQA2S7XYBSGoWkx5IAw/Y7BHmW5r6ryX3uvZh1CScIMb7S60/Vza/4KdxUBLmlicJOSM6/m/ZhMN/g75UrQmkpxBj2YPZep2ZOfR8c67bYNgURXBBXsFUVyFZBI28bowPD+wfMuZcsoLzqFsPrQGkCgmxGezj0zFGhmLt0q090uQhChzN0QskH9IPt3TUqsv6NKMIXe+rzKjI1y/Mb+5L1jvk9YfIbwsuofaNXbJrasfRyE8vFxnEjlqKs4DlMSmV2/0LvmOG7ORgTfqih3MopgxzuDSj2NoghpXTio4L11zJ9T91zYU7KNGgVUJ2btypHIFOuy2HK4d6WN0uT3xytConzLANpIS1czIQyJs9wh7Sc/yHfJaM4hLkTOuRPy50hXNkeLK6ghay7VKvxvHyW8WuzVZGtzFKkt7Wtn8QyRPc8NWHS/Rn8kevwzaDzwHT6udaLkAojafXNpEsz7wqmpYNrEjPAjX6lFvrgKKWbEkp+eal5/ziYrr8p5+cM4F0ihv5MqQrgsG/BeU4Mg8IX2eF/eSwzqPs/E3jQLpQk/05j7DVv43HJcsAoa65AhRV350i6xa5FTem0c4XkWMJLFJfYB6HlZTmO0nSIPDkCveym0+skCtxyPDcJr3hH4zBIWXdn7f/Y7l971VRyA4yzdBE+5JUEasNIP39Wk/2rtzYAQE2Pxa1w+bhEtV915kRVXiBAq+DY+vutgqT+us1O/TJaIILtTLVL/k+++WognE+g1AQO6ivx5oeLn6iTdp8JEDMRXvrhNoFWUobpAMjQKnGlVeFUCUbM3KzWKQx4nc2GZivEIuGSp2nh72a9Ca9BJxfIG5cXAPfWUz0fpmJDq9fIBmZJdVsx2QJ
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hello,

I have been looking at the behavior of dmdas and modular-heft, and noticed that often (especially at the beginning of the DAG), idle times appear on GPU when a higher priority task becomes ready, is assigned to a GPU, but its input data has to wait until the previous prefetch requests for lower-priority tasks are finished. In the attached patches, I propose the following solution: in prio_queue, do not return the highest priority task, but the highest priority task whose data is available. If none exist, return nothing (leave the GPU idle). Additionnally, I have added a callback to the prefetch calls so that the worker can be awakened when a prefetch terminates.

On my tests in simulation, this improves significantly the performance of chameleon Cholesky compared to modular-heft or dmdas. The solution is not perfect though: the wake_up is performed each time any prefetch finishes, whereas if it was possible to record *which* task is concerned, we could wake up only when all data for the task is available. Not sure how much overhead this implies.

Performance tests on sirocco01, for dpotrf_tile --nb=960, N=19200 --niter=6 :

#sched seconds Gflop/s Deviation
dmdas 0.958 2463.35 +- 41.40
modular-heft 0.895 2642.02 +- 112.81
OVERLAP modular-heft 0.879 2685.70 +- 65.11

What do you think of this solution ? In my opinion, the alternative would be either to be aware of the order in which prefetches will finish, and return the task whose data will be available first, OR to be aware of which task will become ready soon, and issue prefetches for its data before it becomes ready.

Best,

Lionel.

PS: Patch n°4 is not really related to this feature, it is just a consistency fix.

>From cb8d6e4a820618bd7518c68206d1ceeb0af18b23 Mon Sep 17 00:00:00 2001
From: Lionel ED <lionel.eyraud-dubois@inria.fr>
Date: Mon, 15 May 2017 12:52:57 +0200
Subject: [PATCH] prio_deque: function to pick a task whose data is already
 available

---
 src/sched_policies/prio_deque.c | 25 +++++++++++++++++++++++++
 src/sched_policies/prio_deque.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/src/sched_policies/prio_deque.c b/src/sched_policies/prio_deque.c
index 2d149a9..74c4199 100644
--- a/src/sched_policies/prio_deque.c
+++ b/src/sched_policies/prio_deque.c
@@ -38,6 +38,23 @@ static inline int pred_can_execute(struct starpu_task * t, void * pworkerid)
 	return 0;
 }
 
+static inline int pred_data_available(struct starpu_task *task, void* pmemory_node) 
+{
+	unsigned nbuffers = STARPU_TASK_GET_NBUFFERS(task);
+	unsigned buffer;
+	unsigned node = *(unsigned*) pmemory_node ;
+	
+	for (buffer = 0; buffer < nbuffers; buffer++)
+	{
+		starpu_data_handle_t handle = STARPU_TASK_GET_HANDLE(task, buffer);
+		enum starpu_data_access_mode mode = STARPU_TASK_GET_MODE(task, buffer);
+		if(mode & STARPU_R) /* Only consider data which we have to read */
+			if(handle->per_node[node].state == STARPU_INVALID)
+				return 0;
+	}
+	return 1; 
+}
+
 #define REMOVE_TASK(pdeque, first_task_field, next_task_field, predicate, parg)	\
 	{									\
 		struct starpu_task * t;						\
@@ -75,3 +92,11 @@ struct starpu_task * _starpu_prio_deque_deque_task_for_worker(struct _starpu_pri
 	STARPU_ASSERT(workerid >= 0 && (unsigned) workerid < starpu_worker_get_count());
 	REMOVE_TASK(pdeque, _tail, next, pred_can_execute, &workerid);
 }
+
+struct starpu_task * _starpu_prio_deque_pop_task_data_available(struct _starpu_prio_deque * pdeque, unsigned node, int* skipped) 
+{
+	STARPU_ASSERT(pdeque);
+//	STARPU_ASSERT(node >= 0 && node < starpu_worker_get_count());
+	REMOVE_TASK(pdeque, _head, prev, pred_data_available, &node);
+
+}
diff --git a/src/sched_policies/prio_deque.h b/src/sched_policies/prio_deque.h
index c2ab343..0fe7dcb 100644
--- a/src/sched_policies/prio_deque.h
+++ b/src/sched_policies/prio_deque.h
@@ -104,4 +104,6 @@ static inline struct starpu_task * _starpu_prio_deque_deque_task(struct _starpu_
  */
 struct starpu_task * _starpu_prio_deque_deque_task_for_worker(struct _starpu_prio_deque *, int workerid, int *skipped);
 
+struct starpu_task * _starpu_prio_deque_pop_task_data_available(struct _starpu_prio_deque * pdeque, unsigned node, int* skipped);
+
 #endif /* __PRIO_DEQUE_H__ */
-- 
1.9.1

>From 769adb069a0531dd3e4121ddba9ee8a0be1c05c9 Mon Sep 17 00:00:00 2001
From: Lionel ED <lionel.eyraud-dubois@inria.fr>
Date: Tue, 16 May 2017 09:59:48 +0200
Subject: [PATCH] Callback to wake up workers when a prefetch is finished

---
 include/starpu_scheduler.h           |  2 ++
 src/datawizard/coherency.c           | 41 ++++++++++++++++++++++++++++++++++++
 src/sched_policies/component_sched.c | 10 ++++++++-
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/starpu_scheduler.h b/include/starpu_scheduler.h
index a2772ba..15b76c8 100644
--- a/include/starpu_scheduler.h
+++ b/include/starpu_scheduler.h
@@ -81,6 +81,8 @@ int starpu_combined_worker_can_execute_task(unsigned workerid, struct starpu_tas
 int starpu_get_prefetch_flag(void);
 int starpu_prefetch_task_input_on_node_prio(struct starpu_task *task, unsigned node, int prio);
 int starpu_prefetch_task_input_on_node(struct starpu_task *task, unsigned node);
+int starpu_prefetch_task_input_on_node_prio_callback(struct starpu_task *task, unsigned node, int prio, void (*cb_fun) (void*), void* cb_arg);
+int starpu_prefetch_task_input_on_node_callback(struct starpu_task *task, unsigned node, void (*cb_fun) (void*), void* cb_arg);
 int starpu_idle_prefetch_task_input_on_node_prio(struct starpu_task *task, unsigned node, int prio);
 int starpu_idle_prefetch_task_input_on_node(struct starpu_task *task, unsigned node);
 
diff --git a/src/datawizard/coherency.c b/src/datawizard/coherency.c
index 8c5e44c..fbd3034 100644
--- a/src/datawizard/coherency.c
+++ b/src/datawizard/coherency.c
@@ -780,6 +780,11 @@ static int prefetch_data_on_node(starpu_data_handle_t handle, int node, struct _
 	return _starpu_fetch_data_on_node(handle, node, replicate, mode, 1, 1, 1, NULL, NULL, prio, "prefetch_data_on_node");
 }
 
+static int prefetch_data_on_node_callback(starpu_data_handle_t handle, int node, struct _starpu_data_replicate *replicate, enum starpu_data_access_mode mode, int prio, void (*callback_func)(void*), void* callback_arg)
+{
+	return _starpu_fetch_data_on_node(handle, node, replicate, mode, 1, 1, 1, callback_func, callback_arg, prio, "prefetch_data_on_node");
+}
+
 static int fetch_data(starpu_data_handle_t handle, int node, struct _starpu_data_replicate *replicate, enum starpu_data_access_mode mode, int prio)
 {
 	return _starpu_fetch_data_on_node(handle, node, replicate, mode, 0, 0, 0, NULL, NULL, prio, "fetch_data");
@@ -889,6 +894,42 @@ int starpu_prefetch_task_input_on_node(struct starpu_task *task, unsigned node)
 	return starpu_prefetch_task_input_on_node_prio(task, node, prio);
 }
 
+int starpu_prefetch_task_input_on_node_prio_callback(struct starpu_task *task, unsigned node, int prio, 
+						     void (*cb_fun) (void*), void* cb_arg)
+// TODO: find a way to call the cb_fun only once per task, not once per handle. 
+{
+	STARPU_ASSERT(!task->prefetched);
+	unsigned nbuffers = STARPU_TASK_GET_NBUFFERS(task);
+	unsigned index;
+
+	for (index = 0; index < nbuffers; index++)
+	{
+		starpu_data_handle_t handle = STARPU_TASK_GET_HANDLE(task, index);
+		enum starpu_data_access_mode mode = STARPU_TASK_GET_MODE(task, index);
+
+		if (mode & (STARPU_SCRATCH|STARPU_REDUX))
+			continue;
+
+		struct _starpu_data_replicate *replicate = &handle->per_node[node];
+		prefetch_data_on_node_callback(handle, node, replicate, mode, prio, cb_fun, cb_arg);
+		
+		_starpu_set_data_requested_flag_if_needed(handle, replicate);
+	}
+	
+	return 0;
+}
+
+int starpu_prefetch_task_input_on_node_callback(struct starpu_task *task, unsigned node, 
+						void (*cb_fun) (void*), void* cb_arg)
+{
+	int prio = task->priority;
+	if (task->workerorder)
+		prio = INT_MAX - task->workerorder;
+	return starpu_prefetch_task_input_on_node_prio_callback(task, node, prio, cb_fun, cb_arg);
+}
+
+
+
 int starpu_idle_prefetch_task_input_on_node_prio(struct starpu_task *task, unsigned node, int prio)
 {
 	unsigned nbuffers = STARPU_TASK_GET_NBUFFERS(task);
diff --git a/src/sched_policies/component_sched.c b/src/sched_policies/component_sched.c
index 37a1b9a..db42a2f 100644
--- a/src/sched_policies/component_sched.c
+++ b/src/sched_policies/component_sched.c
@@ -146,6 +146,14 @@ double starpu_sched_component_transfer_length(struct starpu_sched_component * co
 	return sum / nworkers;
 }
 
+
+// For now, we forget which task it was about. This means that idle GPU workers will
+// keep on searching for available tasks. But anyway, they are idle...
+void _prefetch_callback(void* parg) {
+	struct starpu_sched_component * component = (struct starpu_sched_component *) parg; 
+	component->can_pull(component); 
+}
+
 /* This function can be called by components when they think that a prefetching request can be submitted.
  * For example, it is currently used by the MCT component to begin the prefetching on accelerators 
  * on which it pushed tasks as soon as possible.
@@ -157,7 +165,7 @@ void starpu_sched_component_prefetch_on_node(struct starpu_sched_component * com
 	{
 		int worker = starpu_bitmap_first(component->workers_in_ctx);
 		unsigned memory_node = starpu_worker_get_memory_node(worker);
-		starpu_prefetch_task_input_on_node(task, memory_node);
+		starpu_prefetch_task_input_on_node_callback(task, memory_node, _prefetch_callback, (void*) component);
 		task->prefetched = 1;
 	}
 }
-- 
1.9.1

>From 3eeefbf7615e44fa08c95abf62a47d92a4ee6339 Mon Sep 17 00:00:00 2001
From: Lionel ED <lionel.eyraud-dubois@inria.fr>
Date: Tue, 16 May 2017 10:26:06 +0200
Subject: [PATCH] component_prio: new option 'STARPU_SCHED_OVERLAP_COMMS' to
 allow tasks with available data to bypass higher priority tasks that need to
 wait for data

---
 include/starpu_sched_component.h    |  1 +
 src/sched_policies/component_prio.c | 16 ++++++++++++++--
 src/sched_policies/modular_heft.c   |  1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/starpu_sched_component.h b/include/starpu_sched_component.h
index 6be6a5b..2e9368c 100644
--- a/include/starpu_sched_component.h
+++ b/include/starpu_sched_component.h
@@ -136,6 +136,7 @@ struct starpu_sched_component_prio_data
 {
 	unsigned ntasks_threshold;
 	double exp_len_threshold;
+	int overlap_comms; 
 };
 struct starpu_sched_component *starpu_sched_component_prio_create(struct starpu_sched_tree *tree, struct starpu_sched_component_prio_data *prio_data) STARPU_ATTRIBUTE_MALLOC;
 int starpu_sched_component_is_prio(struct starpu_sched_component *component);
diff --git a/src/sched_policies/component_prio.c b/src/sched_policies/component_prio.c
index e4a3434..4b5bcfd 100644
--- a/src/sched_policies/component_prio.c
+++ b/src/sched_policies/component_prio.c
@@ -47,6 +47,7 @@ struct _starpu_prio_data
 	starpu_pthread_mutex_t mutex;
 	unsigned ntasks_threshold;
 	double exp_len_threshold;
+	int overlap_comms; 
 };
 
 static void prio_component_deinit_data(struct starpu_sched_component * component)
@@ -185,7 +186,15 @@ static struct starpu_task * prio_pull_task(struct starpu_sched_component * compo
 	starpu_pthread_mutex_t * mutex = &data->mutex;
 	const double now = starpu_timing_now();
 	STARPU_COMPONENT_MUTEX_LOCK(mutex);
-	struct starpu_task * task = _starpu_prio_deque_pop_task(prio);
+	struct starpu_task * task; 
+	if(data->overlap_comms && STARPU_SCHED_COMPONENT_IS_SINGLE_MEMORY_NODE(component)) {
+		unsigned memory_node  = starpu_worker_get_memory_node(starpu_bitmap_first(component->workers_in_ctx));
+		task = _starpu_prio_deque_pop_task_data_available(prio, memory_node, NULL); 
+	        /* If no task has available data, return nothing. This will take the
+		 * worker to sleep, and it will be awoken with the prefetch callback */
+	} else {
+		task = _starpu_prio_deque_pop_task(prio);
+	}
 	if(task)
 	{
 		if(!isnan(task->predicted))
@@ -230,7 +239,8 @@ static struct starpu_task * prio_pull_task(struct starpu_sched_component * compo
 	// When a pop is called, a can_push is called for pushing tasks onto
 	// the empty place of the queue left by the popped task.
 
-	starpu_sched_component_send_can_push_to_parents(component); 
+	if ((prio->ntasks < data->ntasks_threshold) && (prio->exp_len < data->exp_len_threshold))
+		starpu_sched_component_send_can_push_to_parents(component); 
 	
 	if(task)
 		return task;
@@ -284,11 +294,13 @@ struct starpu_sched_component * starpu_sched_component_prio_create(struct starpu
 	{
 		data->ntasks_threshold=params->ntasks_threshold;
 		data->exp_len_threshold=params->exp_len_threshold;
+		data->overlap_comms = params->overlap_comms; 
 	}
 	else
 	{
 		data->ntasks_threshold=0;
 		data->exp_len_threshold=0.0;
+		data->overlap_comms = 0; 
 	}
 
 	return component;
diff --git a/src/sched_policies/modular_heft.c b/src/sched_policies/modular_heft.c
index f41fd6c..2fcea80 100644
--- a/src/sched_policies/modular_heft.c
+++ b/src/sched_policies/modular_heft.c
@@ -93,6 +93,7 @@ static void initialize_heft_center_policy(unsigned sched_ctx_id)
 		{
 			.ntasks_threshold = starpu_get_env_number_default("STARPU_NTASKS_THRESHOLD", _STARPU_SCHED_NTASKS_THRESHOLD_DEFAULT),
 			.exp_len_threshold = starpu_get_env_float_default("STARPU_EXP_LEN_THRESHOLD", _STARPU_SCHED_EXP_LEN_THRESHOLD_DEFAULT),
+			.overlap_comms = starpu_get_env_number_default("STARPU_SCHED_OVERLAP_COMMS", 0), 
 		};
 
 	unsigned i;
-- 
1.9.1

>From 8f61702d89ff7cc002dc9008920931b9a644fec4 Mon Sep 17 00:00:00 2001
From: Lionel ED <lionel.eyraud-dubois@inria.fr>
Date: Tue, 16 May 2017 10:00:13 +0200
Subject: [PATCH] correct way to check if a component is a single memory node

---
 src/sched_policies/component_sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/sched_policies/component_sched.c b/src/sched_policies/component_sched.c
index db42a2f..efe09a6 100644
--- a/src/sched_policies/component_sched.c
+++ b/src/sched_policies/component_sched.c
@@ -161,7 +161,7 @@ void _prefetch_callback(void* parg) {
 void starpu_sched_component_prefetch_on_node(struct starpu_sched_component * component, struct starpu_task * task)
 {
 	if (starpu_get_prefetch_flag() && (!task->prefetched)
-		&& (component->properties >= STARPU_SCHED_COMPONENT_SINGLE_MEMORY_NODE))
+		&& STARPU_SCHED_COMPONENT_IS_SINGLE_MEMORY_NODE(component))
 	{
 		int worker = starpu_bitmap_first(component->workers_in_ctx);
 		unsigned memory_node = starpu_worker_get_memory_node(worker);
-- 
1.9.1

[Starpu-devel] Overlapping communications in modular-heft, Lionel Eyraud-Dubois, 16/05/2017

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

[Starpu-devel] Overlapping communications in modular-heft