Objet : Developers list for StarPU
Archives de la liste
- From: Sylvain HENRY <sylvain.henry@inria.fr>
- To: Denis Barthou <denis.barthou@inria.fr>
- Cc: Emmanuel Jeannot <emmanuel.jeannot@labri.fr>, raymond.namyst@inria.fr, starpu-devel@lists.gforge.inria.fr, alexandre.denis@inria.fr
- Subject: Re: [Starpu-devel] SOCL
- Date: Thu, 18 Nov 2010 18:45:57 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Le 18/11/2010 16:54, Denis Barthou a écrit :
Hi, On Wed, Nov 17, 2010 at 1:59 PM, Sylvain
HENRY <sylvain.henry@inria.fr>
wrote:
We can use StarPU "filters" to get a new handle on some sub-buffer.
I should use this here to avoid some superfluous data transfers. Le 17/11/2010 18:53,
Denis Barthou a écrit :
From OpenCL spec: clEnqueueReadBuffer(cl_command_queue /* command_queue */, cl_mem /* buffer */, cl_bool /* blocking_read */, size_t /* offset */, size_t /* cb */, void * /* ptr */, cl_uint /* num_events_in_wait_list */, const cl_event * /* event_wait_list */, cl_event * /* event */) We implement this as follows: 1) We retrieve the StarPU data associated to "buffer" (source data): B = buffer->handle Can you have multiple handles to the same buffer in StarPU (in particular, with different offsets and sizes) ? 2) We store other
parameters (offset, cb and ptr) in a struct P
3) B is used as input data of the copy task, P is passed to the task too as non managed data 4) The copy task is submitted with appropriate dependencies and performs memcpy or clEnqueueReadBuffer.
Implementing clEnqueueWriteBuffer (copy N bytes from Ptr (host memory) into Buffer at Offset) with this solution would be even harder. Ok, right. Offset and sizes are probably a pain.
Just to be sure we are talking about the same OpenCL input: in the OpenCL code, you consider you have only one device, right ? (starPU) In any case, this is the goal since StarPU will be in
charge of using the multiple devices at hand and we don't want
the user to map its tasks to individual devices.
A buffer in StarPU's
virtual memory is in fact a set of buffer instances (using
my own terminology). There is at most one buffer instance
per device per buffer. Each buffer instance can be in a
different state (kind of MESI protocol).
We don't want to transfer data to a particular physical device. However we need one instance (on any device) of the source/destination buffer to read from/write to. What we need is a way to select this buffer instance. Currently it is left to the *task* scheduler to select it (i.e. our copy task takes source or destination buffer as input or output). Yes, I understand better the issue now. So, the copy code, assuming we write one explicitly, requires that you choose a priori the target device ? In such cae, there is no garantee that this target will be
the same as the device chosen for the tasks depending on this
copy, right ?
You cannot write a copy code that depends on where StarPU
will map it ?
StarPU chooses where to schedule a task depending on available buffer instances on each device. To select where to copy, maybe we could give a mark to each device depending on the number of tasks already requiring this buffer (if any) and depending on already available buffers for this task on the device. The following could be an algorithm to select the target device: Suppose: * B is the buffer that we want to write data into; * WaitingTasks is a list of tasks waiting for some buffers; * T.Buffers is the list of buffers required by task T; * Devices is the list of devices (obviously); * D.Buffers is the list of buffers present on device D. val best_dev_per_T = for (T <- WaitingTasks if T.Buffers.contains(B)) yield { //For each device D, find the number of buffers required by T but not present in memory //Result is a List of (DevX, CountX) sorted increasingly with CountN val dev_bufcount = for (D <- Devices if D.canExecute(T)) yield { val s = for (TB <- T.Buffers) yield { if (D.Buffers contains TB) 0 else 1 } (D, s.sum) }.sortBy(_._2) //We take devices that have the best score val best_score = dev_bufcount.first._2 val best_devices = dev_bufcount.takeWhile(_._2 == best_score) //We return best devices for T best_devices.map(_._1) } //Now we want overall best devices sorted by score //Score = number of tasks for which D is one of the best devices val best_devices_sorted = best_dev_per_T.flatten.foldLeft(new HashMap[Device,Int])((hash,dev) => hash.updated(dev -> 1 + hash.getOrElse(dev, 0))).toList.reverse.sortBy(_._2) val best_devices = best_devices_sorted.takeWhile(_._2 == best_devices_sorted.first._2).map(_._1) //We can select one device among the best devices val best_device = best_devices.first I just don't want to be the one that will translate this Scala code in C code. :-) Moreover I'm not sure it will be efficient if we have too many tasks waiting. And we may want to take into account available space left on devices...
If it were possible, you could just trust StarPU to take into
account the strong affinity between the copy and the task that
depend on it, so that both tasks would use the same buffer, no
?
--Denis Using pinning, we could choose this buffer instance but we don't have as much information to make a good choice as StarPU (topology, current buffer instance states, etc.). Cheers Sylvain Report this issue then ... do you have a specific example? I mention this just to explain why the current SOCL implementation is bad. 2) Buffer map/unmap OpenCL can schedule buffer mapping/unmapping anytime in the command graph. StarPU can only map/unmap synchronously (i.e. the mapping command cannot have dependencies on other commands). Current implementation: SOCL uses a fake StarPU CPU computation task to schedule the map/unmap commands. The fake task has the buffer to map as an input to force data transfer in host memory and to make starpu_data_acquire (map command) non blocking. Isn't that what the starpu_data_acquire function does ? (you have an asynchronous function that does similar things, starpu_data_acquire_cb). If not, how does it differ from your needs ? - restricted to CPU worker - with only implicit data dependencies - detached What we need is: - no callback. Host code will test/wait returned event to execute something - explicit dependencies (depends on some events) - "not detached" (returns an event) We need the same thing for "unmap". Kernel compilation & scheduling OpenCL kernels may not be portable. StarPU assumes that every OpenCL kernel can be executed on any OpenCL device. Current implementation: we suppose that every OpenCL kernel can be executed on every OpenCL device, even if it's wrong. We are looking how to change the codelet interface so that we can express more constraints that with the current "where" field. For instance we would certainly add constraints on the available memory, or the availability of double precision, or specify that the task may only execute on a subset of workers. [..] What is left to do: 1) Manage data transfers from/to host memory and between buffers Scheduling of these commands is easily done with events/triggers. However performing data transfers correctly is still bogus. This implies modifications in the way StarPU manages data requests and data transfers. (cf "DataWizard" code) We could implement some starpu_data_cpy_{to,from}_interface functions which take a handle and copy its content into or from an interface provided by the application (that is not attached to a handle). Would something like that be useful to you ? int foo[1024]; struct vector_interface vector = {.nx = 1024, .ptr = foo, .elemsize = sizeof(int)}; starpu_data_cpy_to_interface(handle, &vector); Currently we: - Create a new data: pouet = starpu_data_variable_register(foo, 1024...) - Schedule one StarPU task with available kernels for CPU and OpenCL. -- This task has "pouet" and the original buffer as input/output -- CPU kernel performs memcpy -- OpenCL kernel performs clEnqueueRead/WriteBuffer With the new API we could: - Create a dummy task T that does nothing - Use starpu_data_cpy_interface in its callback to transfer data (blocking problem... that would lead you to starpu_data_cpy_to_interface_cb) - Execute another dummy task (associated to the returned event of this data transfer) to signal dependencies that this one completed Ideally, I'd like starpu_data_cpy_to_interface(handle, interface, int num_events, event*, event*) 2) Kernel compilation and execution 2.1) OpenCL compilers can be slow (NVIDIA...). We may choose not to compile every kernel for every device (what is currently done). 2.2) Some kernels may not be executed on some devices. Short term solution: we may try to compile kernels on devices. If compilation fails, we need a way to exclude failing devices from the list of devices on which the kernel may be scheduled. That would certainly be part of the per-codelet constraints that i mentioned earlier. 3.4) Thread-safety: check StarPU thread-safety. You mean: check OpenCL thread-safety safety :) We should detect what is non-thread safe in OpenCL and add locks around these methods in the OpenCL driver (we did that for the init phase already). 1) execute StarPU OpenCL codelets one at a time, even on different devices (coarse-grained) or 2) alias OpenCL API that is allowed in StarPU's OpenCL codelets and force users to use it (fine-grained) or 3) force StarPU codelet declaration in a totally declarative way Cheers, Sylvain Best, Cédric --Denis |
- [Starpu-devel] SOCL, Sylvain HENRY, 09/11/2010
- <Suite(s) possible(s)>
- Re: [Starpu-devel] SOCL, Cédric Augonnet, 14/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 15/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 17/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 17/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010
Archives gérées par MHonArc 2.6.19+.