starpu-devel - Re: [Starpu-devel] SOCL

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] SOCL

From: Sylvain HENRY <sylvain.henry@inria.fr>
To: Denis Barthou <denis.barthou@inria.fr>
Cc: Emmanuel Jeannot <emmanuel.jeannot@labri.fr>, raymond.namyst@inria.fr, starpu-devel@lists.gforge.inria.fr, alexandre.denis@inria.fr
Subject: Re: [Starpu-devel] SOCL
Date: Thu, 18 Nov 2010 18:45:57 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Le 18/11/2010 16:54, Denis Barthou a écrit : Hi,

On Wed, Nov 17, 2010 at 1:59 PM, Sylvain HENRY <sylvain.henry@inria.fr> wrote:

Le 17/11/2010 18:53, Denis Barthou a écrit :

Sylvain, what are the input/output data (from StarPU point of view) of your copy task ?

From OpenCL spec: clEnqueueReadBuffer(cl_command_queue    /* command_queue */,
                    cl_mem              /* buffer */,
                    cl_bool             /* blocking_read */,
                    size_t              /* offset */,
                    size_t              /* cb */,
                    void *              /* ptr */,
                    cl_uint             /* num_events_in_wait_list */,
                    const cl_event *    /* event_wait_list */,
                    cl_event *          /* event */)

We implement this as follows:
1) We retrieve the StarPU data associated to "buffer" (source data): B = buffer->handle

Can you have multiple handles to the same buffer in StarPU (in particular, with different offsets and sizes) ?

We can use StarPU "filters" to get a new handle on some sub-buffer. I should use this here to avoid some superfluous data transfers.

2) We store other parameters (offset, cb and ptr) in a struct P
3) B is used as input data of the copy task, P is passed to the task too as non managed data
4) The copy task is submitted with appropriate dependencies and performs memcpy or clEnqueueReadBuffer.

To me, I think it would be more natural to perform a transfer by defining a no-op task, with inputs and outputs. Its input is the same as its output and corresponds to the data to copy, and the task is pinned to a particular architecture (this task is just a pass-thru, forcing data to be transfered/copied to the necessary architecture). So this issue boils down to know how to pin a particular task to a particular architecture, bypassing StarPU mapping heuristics. What am I missing ? :-)

It's true that if we pin a task to a device, we can ensure that its input data is present in device memory when the task is scheduled. However, suppose we pin the task to CPU to implement clEnqueueReadBuffer, the problem is that data won't be stored by StarPU at the specified "ptr" address. Moreover, we also need to take "offset" into account.
Implementing clEnqueueWriteBuffer (copy N bytes from Ptr (host memory) into Buffer at Offset) with this solution would be even harder.

Ok, right. Offset and sizes are probably a pain.

From a higher perspective, in OpenCL all tasks are statically mapped to a particular device. When the user programs a data transfer to a device, it's because later on, there is a task mapped to a particular device that is going to use it. In starPU, tasks are not mapped by the user to some device. So there's no point in executing data transfers to a particular device if you cannot ensure the task that is using this data is executed on this same device, right ?

To push a bit further, explicit copies and transfers are just due to the need for OpenCL to map everything to a device. One stupid conversion for StarPU would be to ... ignore all explicit copies and transfers, and it should work... Can you give me a scenario where it doesn't ?? Or formulated in another way, how do you make sure in the conversion OpenCL->StarPU that the data you've just transfered will be used by a task executed on the same device ?

You need to remember that StarPU is seen as single device. clCreateBuffer allocate memory in StarPU's virtual device memory and clEnqueueRead/WriteBuffer are used to copy between host memory and StarPU's virtual device memory.

Just to be sure we are talking about the same OpenCL input: in the OpenCL code, you consider you have only one device, right ? (starPU)

Yes

In any case, this is the goal since StarPU will be in charge of using the multiple devices at hand and we don't want the user to map its tasks to individual devices.

A buffer in StarPU's virtual memory is in fact a set of buffer instances (using my own terminology). There is at most one buffer instance per device per buffer. Each buffer instance can be in a different state (kind of MESI protocol).

We don't want to transfer data to a particular physical device. However we need one instance (on any device) of the source/destination buffer to read from/write to. What we need is a way to select this buffer instance. Currently it is left to the *task* scheduler to select it (i.e. our copy task takes source or destination buffer as input or output).

Yes, I understand better the issue now. So, the copy code, assuming we write one explicitly, requires that you choose a priori the target device ?

Yes.

In such cae, there is no garantee that this target will be the same as the device chosen for the tasks depending on this copy, right ?

It would be up to StarPU to detect tasks (if any) that already depend on this copy task. Using this might help StarPU choosing the best target buffer instance and improving dependent task scheduling.

You cannot write a copy code that depends on where StarPU will map it ?

We don't even know if there will be a task depending on this data copy.
StarPU chooses where to schedule a task depending on available buffer instances on each device. To select where to copy, maybe we could give a mark to each device depending on the number of tasks already requiring this buffer (if any) and depending on already available buffers for this task on the device.

The following could be an algorithm to select the target device:

Suppose:
* B is the buffer that we want to write data into;
* WaitingTasks is a list of tasks waiting for some buffers;
* T.Buffers is the list of buffers required by task T;
* Devices is the list of devices (obviously);
* D.Buffers is the list of buffers present on device D.

val best_dev_per_T = for (T <- WaitingTasks if T.Buffers.contains(B)) yield {

    //For each device D, find the number of buffers required by T but not present in memory
    //Result is a List of (DevX, CountX) sorted increasingly with CountN
    val dev_bufcount = for (D <- Devices if D.canExecute(T)) yield {
        val s = for (TB <- T.Buffers) yield { if (D.Buffers contains TB) 0 else 1 }
        (D, s.sum)
    }.sortBy(_._2)

    //We take devices that have the best score
    val best_score = dev_bufcount.first._2
    val best_devices = dev_bufcount.takeWhile(_._2 == best_score)

    //We return best devices for T
    best_devices.map(_._1)
}

//Now we want overall best devices sorted by score
//Score = number of tasks for which D is one of the best devices
val best_devices_sorted = best_dev_per_T.flatten.foldLeft(new HashMap[Device,Int])((hash,dev) => hash.updated(dev -> 1 + hash.getOrElse(dev, 0))).toList.reverse.sortBy(_._2)
val best_devices = best_devices_sorted.takeWhile(_._2 == best_devices_sorted.first._2).map(_._1)

//We can select one device among the best devices
val best_device = best_devices.first

I just don't want to be the one that will translate this Scala code in C code. :-) Moreover I'm not sure it will be efficient if we have too many tasks waiting. And we may want to take into account available space left on devices...

If it were possible, you could just trust StarPU to take into account the strong affinity between the copy and the task that depend on it, so that both tasks would use the same buffer, no ?

Data affinity may be sufficient for this: the only device containing a buffer instance with a valid state is the one in which the transfer has been performed.

--Denis

Using pinning, we could choose this buffer instance but we don't have as much information to make a good choice as StarPU (topology, current buffer instance states, etc.).

Cheers
Sylvain

- StarPU's execution/transfer traces are wrong

Report this issue then ... do you have a specific example?

I meant that when we use computational tasks to perform data transfers, StarPU has no way to detect this (obviously), thus traces are quite wrong (missing data transfers).
I mention this just to explain why the current SOCL implementation is bad.

2) Buffer map/unmap
OpenCL can schedule buffer mapping/unmapping anytime in the command graph.
StarPU can only map/unmap synchronously (i.e. the mapping command cannot have dependencies on other commands).

Current implementation: SOCL uses a fake StarPU CPU computation task to schedule the map/unmap commands. The fake task has the buffer to map as an input to force data transfer in host memory and to make starpu_data_acquire (map command) non blocking.
Isn't that what the starpu_data_acquire function does ? (you have an asynchronous function that does similar things, starpu_data_acquire_cb). If not, how does it differ from your needs ?

No. starpu_data_acquire_cb is a kind of predefined StarPU task:
- restricted to CPU worker
- with only implicit data dependencies
- detached

What we need is:
- no callback. Host code will test/wait returned event to execute something
- explicit dependencies (depends on some events)
- "not detached" (returns an event)

We need the same thing for "unmap".

Kernel compilation & scheduling
OpenCL kernels may not be portable.
StarPU assumes that every OpenCL kernel can be executed on any OpenCL device.

Current implementation: we suppose that every OpenCL kernel can be executed on every OpenCL device, even if it's wrong.
We are looking how to change the codelet interface so that we can express more constraints that with the current "where" field. For instance we would certainly add constraints on the available memory, or the availability of double precision, or specify that the task may only execute on a subset of workers.

Good news :)

[..]

What is left to do:

1) Manage data transfers from/to host memory and between buffers
Scheduling of these commands is easily done with events/triggers. However performing data transfers correctly is still bogus.
This implies modifications in the way StarPU manages data requests and data transfers. (cf "DataWizard" code)
We could implement some starpu_data_cpy_{to,from}_interface functions which take a handle and copy its content into or from an interface provided by the application (that is not attached to a handle). Would something like that be useful to you ?

int foo[1024];
struct vector_interface vector = {.nx = 1024, .ptr = foo, .elemsize = sizeof(int)};
starpu_data_cpy_to_interface(handle, &vector);

That would improve the current situation a little bit.
Currently we:
- Create a new data: pouet = starpu_data_variable_register(foo, 1024...)
- Schedule one StarPU task with available kernels for CPU and OpenCL.
-- This task has "pouet" and the original buffer as input/output
-- CPU kernel performs memcpy
-- OpenCL kernel performs clEnqueueRead/WriteBuffer

With the new API we could:
- Create a dummy task T that does nothing
- Use starpu_data_cpy_interface in its callback to transfer data (blocking problem... that would lead you to starpu_data_cpy_to_interface_cb)
- Execute another dummy task (associated to the returned event of this data transfer) to signal dependencies that this one completed

Ideally, I'd like starpu_data_cpy_to_interface(handle, interface, int num_events, event*, event*)

2) Kernel compilation and execution

2.1) OpenCL compilers can be slow (NVIDIA...). We may choose not to compile every kernel for every device (what is currently done).

2.2) Some kernels may not be executed on some devices.
Short term solution: we may try to compile kernels on devices. If compilation fails, we need a way to exclude failing devices from the list of devices on which the kernel may be scheduled.

That would certainly be part of the per-codelet constraints that i mentioned earlier.
3.4) Thread-safety: check StarPU thread-safety.
You mean: check OpenCL thread-safety safety :) We should detect what is non-thread safe in OpenCL and add locks around these methods in the OpenCL driver (we did that for the init phase already).

As I mentioned on IRC, it's not that easy. StarPU codelets with OpenCL support contain OpenCL code that StarPU is not aware of. For OpenCL 1.0 thread safety you would have to:
1) execute StarPU OpenCL codelets one at a time, even on different devices (coarse-grained)
or
2) alias OpenCL API that is allowed in StarPU's OpenCL codelets and force users to use it (fine-grained)
or
3) force StarPU codelet declaration in a totally declarative way

Cheers,
Sylvain

Best,
Cédric

--Denis

[Starpu-devel] SOCL, Sylvain HENRY, 09/11/2010
- <Suite(s) possible(s)>
- Re: [Starpu-devel] SOCL, Cédric Augonnet, 14/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 15/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 17/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 17/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] SOCL