Accéder au contenu.
Menu Sympa

starpu-devel - Re: [Starpu-devel] SOCL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] SOCL


Chronologique Discussions 
  • From: Denis Barthou <denis.barthou@inria.fr>
  • To: Sylvain HENRY <sylvain.henry@inria.fr>
  • Cc: Emmanuel Jeannot <emmanuel.jeannot@labri.fr>, raymond.namyst@inria.fr, starpu-devel@lists.gforge.inria.fr, alexandre.denis@inria.fr
  • Subject: Re: [Starpu-devel] SOCL
  • Date: Wed, 17 Nov 2010 11:53:08 -0600
  • List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
  • List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

I just have questions on the first part so far.

2010/11/14 Sylvain HENRY <sylvain.henry@inria.fr>
Hi Cedric,

Le 14/11/2010 10:53, Cédric Augonnet a écrit :

Hi Sylvain,

This is not a full answer, but just a few questions to help us understand your issues a little better...


Current implementation: SOCL uses a fake StarPU computation task to schedule data transfers. This fake task uses memcpy or blocking clEnqueueRead/WriteBuffer to copy data. This has several drawbacks:
 - computing devices are considered busy while they are in fact waiting for DMA transfer to complete
 - data transfer may not be optimal (DMA where memcpy could have been used, etc.)
We are current looking how to add an efficient starp_data_cpy function. This function copies the content of a handle into another, and a callback is possibly executed when the transfer is done. We still use a task internally, but we'll look how to do that better in the future.


Sylvain, what are the input/output data (from StarPU point of view) of your copy task ?
 
To me, I think it would be more natural to perform a transfer by defining a no-op task, with inputs and outputs. Its input is the same as its output and corresponds to the data to copy, and the task is pinned to a particular architecture (this task is just a pass-thru, forcing data to be transfered/copied to the necessary architecture).  So this issue boils down to know how to pin a particular task to a particular architecture, bypassing StarPU mapping heuristics. What am I missing ? :-)

From a higher perspective, in OpenCL  all tasks are statically mapped to a particular device. When the user programs a data transfer to a device, it's because later on, there is a task mapped to a particular device that is going to use it. In starPU, tasks are not mapped by the user to some device. So there's no point in executing data transfers  to a particular device if you cannot ensure the task that is using this data is executed on this same device, right ? To push a bit further, explicit copies and transfers are just due to the need for OpenCL to map everything to a device. One stupid conversion for StarPU would be to ... ignore all explicit copies and transfers, and it should work... Can you give me a scenario where it doesn't ?? Or formulated in another way, how do you make sure in the conversion OpenCL->StarPU that the data you've just transfered will be used by a task executed on the same device ?
 
 - StarPU's execution/transfer traces are wrong
Report this issue then ... do you have a specific example?
I meant that when we use computational tasks to perform data transfers, StarPU has no way to detect this (obviously), thus traces are quite wrong (missing data transfers).
I mention this just to explain why the current SOCL implementation is bad.


2) Buffer map/unmap
OpenCL can schedule buffer mapping/unmapping anytime in the command graph.
StarPU can only map/unmap synchronously (i.e. the mapping command cannot have dependencies on other commands).

Current implementation: SOCL uses a fake StarPU CPU computation task to schedule the map/unmap commands. The fake task has the buffer to map as an input to force data transfer in host memory and to make starpu_data_acquire (map command) non blocking.
Isn't that what the starpu_data_acquire function does ? (you have an asynchronous function that does similar things, starpu_data_acquire_cb). If not, how does it differ from your needs ?
No. starpu_data_acquire_cb is a kind of predefined StarPU task:
 - restricted to CPU worker
 - with only implicit data dependencies
 - detached

What we need is:
 - no callback. Host code will test/wait returned event to execute something
 - explicit dependencies (depends on some events)
 - "not detached" (returns an event)

We need the same thing for "unmap".

Kernel compilation & scheduling
OpenCL kernels may not be portable.
StarPU assumes that every OpenCL kernel can be executed on any OpenCL device.

Current implementation: we suppose that every OpenCL kernel can be executed on every OpenCL device, even if it's wrong.
We are looking how to change the codelet interface so that we can express more constraints that with the current "where" field. For instance we would certainly add constraints on the available memory, or the availability of double precision, or specify that the task may only execute on a subset of workers.
Good news :)

[..]

What is left to do:

1) Manage data transfers from/to host memory and between buffers
Scheduling of these commands is easily done with events/triggers. However performing data transfers correctly is still bogus.
This implies modifications in the way StarPU manages data requests and data transfers. (cf "DataWizard" code)
We could implement some starpu_data_cpy_{to,from}_interface functions which take a handle and copy its content into or from an interface provided by the application (that is not attached to a handle). Would something like that be useful to you ?

int foo[1024];
struct vector_interface vector = {.nx = 1024, .ptr = foo, .elemsize = sizeof(int)};
starpu_data_cpy_to_interface(handle, &vector);
That would improve the current situation a little bit.
Currently we:
 - Create a new data: pouet = starpu_data_variable_register(foo, 1024...)
 - Schedule one StarPU task with available kernels for CPU and OpenCL.
 -- This task has "pouet" and the original buffer as input/output
 -- CPU kernel performs memcpy
 -- OpenCL kernel performs clEnqueueRead/WriteBuffer

With the new API we could:
 - Create a dummy task T that does nothing
 - Use starpu_data_cpy_interface in its callback to transfer data (blocking problem... that would lead you to starpu_data_cpy_to_interface_cb)
 - Execute another dummy task (associated to the returned event of this data transfer) to signal dependencies that this one completed

Ideally, I'd like starpu_data_cpy_to_interface(handle, interface, int num_events, event*, event*)



2) Kernel compilation and execution

2.1) OpenCL compilers can be slow (NVIDIA...). We may choose not to compile every kernel for every device (what is currently done).

2.2) Some kernels may not be executed on some devices.
Short term solution: we may try to compile kernels on devices. If compilation fails, we need a way to exclude failing devices from the list of devices on which the kernel may be scheduled.

That would certainly be part of the per-codelet constraints that i mentioned earlier.
3.4) Thread-safety: check StarPU thread-safety.
You mean: check OpenCL thread-safety safety :) We should detect what is non-thread safe in OpenCL and add locks around these methods in the OpenCL driver (we did that for the init phase already).
As I mentioned on IRC, it's not that easy. StarPU codelets with OpenCL support contain OpenCL code that StarPU is not aware of. For OpenCL 1.0 thread safety you would have to:
1) execute StarPU OpenCL codelets one at a time, even on different devices (coarse-grained)
or
2) alias OpenCL API that is allowed in StarPU's OpenCL codelets and force users to use it (fine-grained)
or
3) force StarPU codelet declaration in a totally declarative way

Cheers,
Sylvain

Best,
Cédric



--Denis



Archives gérées par MHonArc 2.6.19+.

Haut de le page