Objet : Developers list for StarPU
Archives de la liste
- From: Sylvain HENRY <sylvain.henry@inria.fr>
- To: starpu-devel@lists.gforge.inria.fr, alexandre.denis@inria.fr, Denis Barthou <denis.barthou@inria.fr>, raymond.namyst@inria.fr, Emmanuel Jeannot <emmanuel.jeannot@labri.fr>
- Subject: [Starpu-devel] SOCL
- Date: Tue, 09 Nov 2010 14:39:43 +0100
- List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel>
- List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>
Hi all,
Necessary background: SOCL ( https://gforge.inria.fr/projects/socl ) is a kind of front-end for StarPU. It implements OpenCL API over StarPU. Advantages are:
- Applications that use OpenCL can be ported without any development to StarPU
- It gives StarPU a recognized standard interface
- SOCL applications can easily fall back to other OpenCL implementations
- We can easily compare performances of both implementations
=========
Currently there is a working implementation that has poor performances. Here are the reasons why and how it could be fixed.
1) Data transfers
OpenCL not only schedules computation tasks (kernels) but also data transfers (from/to host memory, between buffers...).
StarPU can only schedule computation tasks.
1.1) Host memory <--> device memory
OpenCL allows buffer data to be transferred to any place in host memory anytime in the command graph.
StarPU automatically manages data transfers. Each buffer has only one assigned location in host memory.
Current implementation: SOCL uses a fake StarPU computation task to schedule data transfers. This fake task uses memcpy or blocking clEnqueueRead/WriteBuffer to copy data. This has several drawbacks:
- computing devices are considered busy while they are in fact waiting for DMA transfer to complete
- data transfer may not be optimal (DMA where memcpy could have been used, etc.)
- StarPU's execution/transfer traces are wrong
1.2) Buffer to buffer transfers
OpenCL allows data copy from one buffer to another
StarPU doesn't provide this functionality.
Current implementation: SOCL uses a fake StarPU computation task to schedule and perform this data transfer with the same drawbacks as in 1.1.
2) Buffer map/unmap
OpenCL can schedule buffer mapping/unmapping anytime in the command graph.
StarPU can only map/unmap synchronously (i.e. the mapping command cannot have dependencies on other commands).
Current implementation: SOCL uses a fake StarPU CPU computation task to schedule the map/unmap commands. The fake task has the buffer to map as an input to force data transfer in host memory and to make starpu_data_acquire (map command) non blocking.
3) Kernel compilation & scheduling
OpenCL kernels may not be portable.
StarPU assumes that every OpenCL kernel can be executed on any OpenCL device.
Current implementation: we suppose that every OpenCL kernel can be executed on every OpenCL device, even if it's wrong.
============
Alongside SOCL development branch, a development version of StarPU has been modified as follows:
1) Task graph has been replaced by a command graph. Each scheduled command returns an "event" object that allows other commands to depend on the completion of the former.
2) Execution of StarPU task is now one of these commands:
int starpu_task_submit(struct starpu_task *task, starpu_event *event);
int starpu_task_submit_ex(struct starpu_task *task, int num_events, starpu_events *events, starpu_event *event);
If "event" is not NULL, other commands can depend on it (i.e. put it in the "events" parameter).
3) "Events" are used in conjunction with "triggers".
Event:
- Two states: not complete, complete
- Allows synchronous wait for completion
- Allows test for completion (non blocking)
- Managed with reference counting (automatically freed)
- Allows multiple "trigger" registrations
- Signal to triggers on completion
- Implemented with one mutex and one condition
Trigger:
- Callback that depends on multiple events
- Registered to each event it depends on
- Once signaled by every event, callback is executed
- Implemented with a counter, "__sync_fetch_and_add" and "__sync_sub_and_fetch"
- Initially the counter is set to 1. Each signal decrements the counter. Enabling the trigger decrements the counter and forbids further event registering. When counter reaches 0, the callback is executed.
These two simple constructs allow easy scheduling of the various commands (barrier, task execution, data transfer, user event commands, marker, etc.).
4) Using events/triggers, each input/output data of a computational task sends a signal to the computational task once it is ready to be used by the task (in the specified mode: Read, Write, Read-Write...). This simplifies the management of the list of tasks that are to be scheduled.
===========
What is left to do:
1) Manage data transfers from/to host memory and between buffers
Scheduling of these commands is easily done with events/triggers. However performing data transfers correctly is still bogus.
This implies modifications in the way StarPU manages data requests and data transfers. (cf "DataWizard" code)
2) Kernel compilation and execution
2.1) OpenCL compilers can be slow (NVIDIA...). We may choose not to compile every kernel for every device (what is currently done).
2.2) Some kernels may not be executed on some devices.
Short term solution: we may try to compile kernels on devices. If compilation fails, we need a way to exclude failing devices from the list of devices on which the kernel may be scheduled.
3) OpenCL 1.1 support
SOCL supports OpenCL 1.0. It will eventually need to support OpenCL 1.1.
3.1) Sub-buffers: easy implementation with StarPU filters
3.2) User events: easy implementation with StarPU events/triggers
3.3) Event callbacks: easy implementation with StarPU events/triggers
3.4) Thread-safety: check StarPU thread-safety.
3.5) Support of OpenCL 1.1 in StarPU: StarPU's events and triggers can subsume OpenCL's user events and event callbacks. Thus, StarPU's OpenCL backend doesn't need to use polling threads for OpenCL 1.1 devices.
=========
You can find this modified version of StarPU here: https://github.com/hsyl20/StarPU
SOCL's branch socl-events-0.4.2 works with StarPU's branch starpu-events-0.4.2. This is the working implementation with poor performances.
SOCL's branch socl-events-next "works" with StarPU's branch starpu-events-next. This is the experimental version which doesn't work properly yet.
Any comment is appreciated :-)
Regards,
Sylvain
- [Starpu-devel] SOCL, Sylvain HENRY, 09/11/2010
- <Suite(s) possible(s)>
- Re: [Starpu-devel] SOCL, Cédric Augonnet, 14/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 15/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 17/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 17/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010
- Re: [Starpu-devel] SOCL, Denis Barthou, 18/11/2010
- Re: [Starpu-devel] SOCL, Sylvain HENRY, 18/11/2010
Archives gérées par MHonArc 2.6.19+.