starpu-devel - Re: [Starpu-devel] StarPU+SimGrid: FetchingInput computation

Objet : Developers list for StarPU

Archives de la liste

Re: [Starpu-devel] StarPU+SimGrid: FetchingInput computation

From: Luka Stanisic <luka.stanisic@inria.fr>
To: Mirko Myllykoski <mirkom@cs.umu.se>
Cc: starpu-devel@lists.gforge.inria.fr
Subject: Re: [Starpu-devel] StarPU+SimGrid: FetchingInput computation
Date: Tue, 20 Dec 2016 16:54:53 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/starpu-devel/>
List-id: "Developers list. For discussion of new features, code changes, etc." <starpu-devel.lists.gforge.inria.fr>

Hi,

Indeed, appropriate design of experiments is essential when using such models. There are many lectures on this topic online, the one that I know well is:

https://github.com/alegrand/SMPE/blob/master/lectures/5_design_of_experiments.pdf

If you have time and want to understand more deeply this area, some good references are:

R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling"
http://www.cs.wustl.edu/~jain/books/perfbook.htm

Montgomery, Douglas (2013). Design and analysis of experiments
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002024.html

Best,
Luka
On 17/12/2016 10:57, Mirko Myllykoski wrote:

Hi Luka,

Thank you. Will look into the matter further after the New Year. The insufficiently accurate process_window model is probably one culprit but my calibration procedure might also require some improvement. I should probably implement a separate calibration program that would scan thought the parameter space (M,N,F,W) more evenly.

- Mirko

On 2016-12-16 18:41, Luka Stanisic wrote:

Hi Mirko,

There was an error in StarPU code (a matrix was filled row-by-row
instead of column-by-column at some point), but it is corrected now.
If you update to the most recent version of the trunk, it should now
work properly-both StarPU and R generating the same models. Please
report if you encounter any new problems.

Nevertheless, you were right to manually change the coefficient value
(copying them from R). In the future, if you want to test different
models (e.g., if you have multiple modes of your process_window_pm),
this is the right way to go.

So now with the R coefficients (and soon automatically with the same
StarPU ones), you are getting close to the real execution, but not yet
completely. If I understand correctly, this is now mostly due to the
insufficiently accurate process_window model?

The best I can suggest is to try to add some additional parameters to
this codelet or even generate separate models for different group of
use case. Unfortunately, there is no magic to help you here and the
obtained models indeed depend a lot on design of experiments. We do
have some experience on this topic, so if you are not sure how to
proceed, do not hesitate to ask for more help.

Thank you very much for reporting this (and the previous) problem!

Have a nice weekend :)

Best regards,
Luka

On 16/12/2016 17:28, Mirko Myllykoski wrote:

Hi Luka,

I just copy-pasted the R coefficient to StarPU's configuration files (sampling/codelets/44/*) and the simulation became a bit more accurate:

Actual execution time: 2077.96ms
SimGrid + StarPU coefficients: 248.836ms
SimGrid + R coefficients: 1649.81ms

- Mirko

On 2016-12-16 16:18, Mirko Myllykoski wrote:

Hi Luka,

I provide the flop estimate myself (starpu_task_insert +
STARPU_FLOPS). It is a crude estimate based on the number of
operations performed inside a computational window. An important thing
to know is that it is always of the form a * W, i.e., F = a * W / W =
a.

If the window is small enough, then the flop estimate can directly
predict the execution time (that is what I did previously with the
linear regression model). However, the codelet behaves very
differently when the window is bigger (cache problem). My plan is to
implement a better estimate (I have to extract some information from
the data handles and process it) but right now I want to be sure that
everything else is ok.

I ran the program ones with STARPU_CALIBRATE=2 after clearing the
sampling directory. Here are the results:

============================================================

process_window:

# n intercept F*W^3 F*W^2 F*W F
5 4.611986e-10 3.379008e-10 4.547075e-10 3.516407e-10 8.197837e-10

Call:
lm(formula = Duration ~ (I(W^3) + I(W^2) + W + 1):F, data = df)

Residuals:
Min 1Q Median 3Q Max
-4155.0 -1015.6 -46.5 1215.0 6236.2

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.348e+03 4.004e+02 10.858 < 2e-16 ***
I(W^3):F -4.707e-09 1.926e-09 -2.445 0.01577 *
I(W^2):F 2.709e-06 9.546e-07 2.838 0.00524 **
W:F -3.685e-04 1.186e-04 -3.107 0.00230 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1976 on 137 degrees of freedom
Multiple R-squared: 0.6508, Adjusted R-squared: 0.6432
F-statistic: 85.11 on 3 and 137 DF, p-value: < 2.2e-16

============================================================

left_update:

# n intercept M^2*N
2 3.374755e-05 4.657683e-05

Call:
lm(formula = Duration ~ I(M^2):N, data = df)

Residuals:
Min 1Q Median 3Q Max
-2688.9 -500.4 11.9 680.5 7521.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.994e+02 3.516e+01 25.58 <2e-16 ***
I(M^2):N 2.568e-05 2.354e-07 109.12 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1003 on 1563 degrees of freedom
Multiple R-squared: 0.884, Adjusted R-squared: 0.8839
F-statistic: 1.191e+04 on 1 and 1563 DF, p-value: < 2.2e-16

============================================================

right_update:
# n intercept M*N^2
2 5.729632e-05 1.279552e-04

Call:
lm(formula = Duration ~ I(N^2):M, data = df)

Residuals:
Min 1Q Median 3Q Max
-2596.3 -919.8 -8.4 746.0 6086.6

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.101e+03 8.146e+01 13.52 <2e-16 ***
I(N^2):M 2.430e-05 5.401e-07 44.99 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1196 on 470 degrees of freedom
Multiple R-squared: 0.8116, Adjusted R-squared: 0.8112
F-statistic: 2024 on 1 and 470 DF, p-value: < 2.2e-16

============================================================

- Mirko

On 2016-12-16 15:34, Luka Stanisic wrote:

Hi,
On 16/12/2016 14:39, Mirko Myllykoski wrote:

Hi Luka,

The link that was included in my previous email should include the whole sampling directory. However, here is a direct link to the left_update_pm.out file you requested:

https://dl.dropboxusercontent.com/u/1521774/problem/left_update_pm.out
Sorry my bad, I missed the fact that sampling was already attached and
that it contains tmp folder.

I calibrated the models by running the program multiple times with different parameter values. The STARPU_CALIBRATE environmental variable was set to 1 during these calibration runs.
Yes, this is the way it should be done, so it is very strange that the
values computed by StarPU and R are not matching.

Just for the sake of testing, could you please run one single
experiment with STARPU_CALIBRATE=2 to generate a trace and then
compare again StarPU and R output for left_update_pm? Before that save
your current sampling folder somewhere and delete the current one. Or
simply make sure that the new sampling folder is generated using
STARPU_HOME=~/new/folder/place/

I agree that my current model for the process_window codelet is inaccurate.
First thing you should do is probably not to divide flops/window for F
parameter. To do properly multiple linear regression, all parameters
should be independent, and if I understand the code correctly your F
already contains W.

Second, how is task->flops computed? Is it something provided by
StarPU or you compute it somewhere?

Finally, I see that you have unused parameters &block_size and
&normal. Could these two help with the model?

However, to me, the paje traces (see the previous link) indicate that this is not the only problem.

- Mirko

On 2016-12-16 14:22, Luka Stanisic wrote:

Hi Mirko,

Models produced by StarPU should match the ones produced by R. Did you
use exactly the same experiment data to generate both of them?

Could you please send me your
.starpu/sampling/codelets/tmp/left_update_pm.out so I can check as
well the models?

BTW the parameters and their combinations for left_update and
right_update codelets seem to be good, so the generated models should
be quite accurate. However, this is not the case for the
process_window codelet which is not so well describe with the provided
parameters. Its R^2 is low (0.66) and coefficients are extremely high
(e-11), so in order to use this type of models for process_window you
will need to find a better formula or add additional parameters. The
fact that all parameter combinations you are using (F:W+F:W^2+F:W^3)
appear significant is actually probably due to overfitting.

Best,
Luka

On 16/12/2016 13:59, Mirko Myllykoski wrote:

Hi,

I am having some problems with the multiple regression models. The models produces by StarPU appear to be different from the ones produced by R and the predicted execution times are also way off.

Here is the data:

https://www.dropbox.com/sh/16kbqq238su7fac/AAApwEJk-4NwmaKoJpxwSXy3a?dl=0 =================================================================================== void process_window_parameters(struct starpu_task *task, double *parameters)
{
int begin, end, block_size;
window_normal_t normal;
starpu_codelet_unpack_args(
task->cl_arg, &begin, &end, &block_size, &normal);

// swap count (F), should be linear
parameters[0] = task->flops/(end-begin);

// window size (W), exhibits non-linear behavior
parameters[1] = end-begin;
}

static struct starpu_perfmodel process_window_pm = {
.type = STARPU_MULTIPLE_REGRESSION_BASED,
.symbol = "process_window_pm",
.parameters = &process_window_parameters,
.nparameters = 2,
.parameters_names = (const char*[]) { "F", "W" },
.combinations = (unsigned*[]) {
(unsigned[]) { 1, 3 },
(unsigned[]) { 1, 2 },
(unsigned[]) { 1, 1 },
(unsigned[]) { 1, 0 }},
.ncombinations = 4
};

StarPU model:

# n intercept F*W^3 F*W^2 F*W F
5 1.208684e-10 2.724579e-11 5.466467e-11 6.864963e-11 7.812351e-11

R analysis:

Call:
lm(formula = Duration ~ (I(W^3) + I(W^2) + I(W)):F + F, data = df)

Residuals:
Min 1Q Median 3Q Max
-61638 -4689 -1517 2868 97895

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.218e+03 2.676e+02 30.71 <2e-16 ***
F -8.920e-02 5.437e-03 -16.41 <2e-16 ***
I(W^3):F 2.581e-09 1.637e-10 15.77 <2e-16 ***
I(W^2):F -2.373e-06 1.667e-07 -14.24 <2e-16 ***
I(W):F 7.851e-04 5.354e-05 14.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10460 on 4590 degrees of freedom
Multiple R-squared: 0.6702, Adjusted R-squared: 0.6699
F-statistic: 2332 on 4 and 4590 DF, p-value: < 2.2e-16

=================================================================================== void left_update_parameters(struct starpu_task *task, double *parameters)
{
int cbegin, cend, rbegin, rend, block_size;
starpu_codelet_unpack_args(
task->cl_arg, &rbegin, &rend, &cbegin, &cend, &block_size);

// Row count (M), exhibits non-linear behavior (probably quadratic)
parameters[0] = rend-rbegin;

// Column count (N), should be linear
parameters[1] = cend-cbegin;
}

static struct starpu_perfmodel left_update_pm = {
.type = STARPU_MULTIPLE_REGRESSION_BASED,
.symbol = "left_update_pm",
.parameters = &left_update_parameters,
.nparameters = 2,
.parameters_names = (const char*[]) { "M", "N" },
.combinations = (unsigned*[]) { (unsigned[]) { 2, 1 } },
.ncombinations = 1
};

StarPU model:

# n intercept M^2*N
2 7.926921e-06 1.863154e-05

R analysis:

Call:
lm(formula = Duration ~ I(M^2):N, data = df)

Residuals:
Min 1Q Median 3Q Max
-18152.9 -1030.8 -647.1 699.4 21390.7

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.190e+03 2.140e+01 55.58 <2e-16 ***
I(M^2):N 6.272e-05 1.130e-07 555.12 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2049 on 12022 degrees of freedom
Multiple R-squared: 0.9625, Adjusted R-squared: 0.9624
F-statistic: 3.082e+05 on 1 and 12022 DF, p-value: < 2.2e-16

=================================================================================== void right_update_parameters(struct starpu_task *task, double *parameters)
{
int cbegin, cend, rbegin, rend, block_size;
starpu_codelet_unpack_args(
task->cl_arg, &rbegin, &rend, &cbegin, &cend, &block_size);

// Row count (M), should be linear
parameters[0] = rend-rbegin;

// Column count (N), exhibits non-linear behavior (probably quadratic)
parameters[1] = cend-cbegin;
}

static struct starpu_perfmodel right_update_pm = {
.type = STARPU_MULTIPLE_REGRESSION_BASED,
.symbol = "right_update_pm",
.parameters = &right_update_parameters,
.nparameters = 2,
.parameters_names = (const char*[]) { "M", "N" },
.combinations = (unsigned*[]) { (unsigned[]) { 1, 2 } },
.ncombinations = 1
};

StarPU model:

# n intercept M*N^2
2 7.679284e-06 1.281794e-05

R analysis:

Call:
lm(formula = Duration ~ I(N^2):M, data = df)

Residuals:
Min 1Q Median 3Q Max
-24144 -466 -392 -92 35622

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.118e+02 5.824e+00 87.86 <2e-16 ***
I(N^2):M 6.506e-05 4.722e-08 1377.73 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1482 on 72861 degrees of freedom
Multiple R-squared: 0.963, Adjusted R-squared: 0.963
F-statistic: 1.898e+06 on 1 and 72861 DF, p-value: < 2.2e-16

=================================================================================== Best Regards,
Mirko

On 2016-12-14 11:38, Luka Stanisic wrote:

Hi Mirko,

Strangely, the difference between Native and SimGrid does seem to come
from FetchingInput on workers and/or Allocating on memory node (see
the attached table with some stats). Other discrepancies in state
durations are less relevant or a product of the previous two. In fact,
I am not even sure if maybe longer Allocating is actually responsible
for longer corresponding FetchingInputs. I will inspect the code in
more details and come to you ASAP.

As duration of your tasks (process_window, left_update, right_update)
is predicted quite accurately, I believe that once we solve this
Allocating/FetchingInput issue, you will be able to get faithful
simulation predictions for more cores and more complicated inputs. In
the future, if your tasks are too complex and simple linear regression
models arent sufficient, you might want to switch to more advanced
multiple linear regression models, currently available only in StarPU
trunk (see examples/mlr/mlr.c and trunk documentation for more
details), but that will be the part of the next StarPU release.

Best,
Luka

On 14/12/2016 10:22, Mirko Myllykoski wrote:

Hi Luka,

Just to clarify my previous email. I do not doubt SimGrid's ability to accurately predict execution times in general. However, the numerical code I am developing has some features which I believe will make the simulation harder (large number of small tasks, complicated data dependencies, varying sensitivity to parameter value changes, etc).

Best Regards,
Mirko

On 2016-12-14 09:52, Mirko Myllykoski wrote:

Hi Luka,

Here are the two paje traces you requested:

https://dl.dropboxusercontent.com/u/1521774/paje.trace.tar.gz
https://dl.dropboxusercontent.com/u/1521774/paje-simgrid.trace.tar.gz I must say that the way I am using the starpu_perfmodel::size_base
field is a bit unorthodox. That's why I ran the simulation twice, once
with the size_base field and once without it. My ultimate intention is
to autotune my code using an external black box software. However, the
execution time may vary from a few seconds to hours depending on the
input data and various parameters. I hope that SimGrid would help with
this problem by saving countless CPU hours.

Right now, I am trying to figure out whether this idea is feasible. I
realized that a linear regression model can predict the codelet
execution times quite accurately provided that the input data does not
change too much and parameters are kept constants. These regression
models cannot be used to autotune anything but if SimGrid fails the
predict the total execution time in this simple case, then I can be
quite sure that this overall idea does not work and I should try
something else instead.

Best Regards,
Mirko

On 2016-12-13 16:40, Luka Stanisic wrote:

Hi Mirko,

Indeed, I was wondering if your platform has any GPUs, but as you said
it is a simple 4 cores machine. Adding more CPUs or GPUs in future
shouldnt be a problem.

You are right, SimGrid shouldnt include any significant fetching time
to the simulation since everything is running with shared memory.
However, appearance of FetchingInput state in the traces is possible,
since StarPU is passing through many parts of the code. Still, the
duration of FetchingInput should be negligible.

Could you please share two paje.trace traces (one for real execution
and one for SimGrid), so I can try to understand better what is
happening? If the traces are big (>100MB), it might be better to run
your application with smaller problem size (if possible).

Also from what I have seen, you are using STARPU_REGRESSION_BASED or
STARPU_NL_REGRESSION_BASED performance models for your codelets,
right? Is this something that you need for your application?
Personally, I have never tried to simulate applications using these
models, although I dont see any reason why it shouldnt work. The
starpu_perfmodel::size_base field is actually used by these models,
more information is available here:
http://starpu.gforge.inria.fr/doc/html/OnlinePerformanceTools.html So my first guess is that you are somehow using codelet perfmodels and
their size_base incorrectly (or there is an unknown bug in StarPU or
StarPU+SimGrid code), which makes simulation longer than expected.
Then, in the traces this is manifested as long FetchingInputs,
although fetching inputs have nothing to do with the actual problem.

Best regards,
Luka
On 13/12/2016 14:34, Mirko Myllykoski wrote:

Hi Luka,

and thank you for your reply.

I performed the same experiment twice, once with the size_base field included and once without it. I erased the samples directory before each experiment and gave it a few rounds to calibrate properly (STARPU_CALIBRATE=1). Here are the corresponding sample folders:

https://dl.dropboxusercontent.com/u/1521774/sampling_with_size_base.tar.gz https://dl.dropboxusercontent.com/u/1521774/sampling_without_size_base.tar.gz In this case, the error seems to be about 35%.

As I mentioned in my previous email, the code is shared memory only (at the moment). I performed the experiment on my local machine (quad-core i5) but my plan is to move on to a bigger machine (28 or 42 cores per node) and distributed memory once everything works.

I don't quite understand why SimGrid would include any fetching time to the simulation since everything is running in shared memory.

Best Regards,
Mirko

On 2016-12-12 18:21, Luka Stanisic wrote:

Hello Mirko,

Indeed, 50% prediction error is quite big and it suggests that
something is probably not correctly configured. Could you please send
us a compressed version of you ".starpu/sampling" folder, the one from
which simulation will read the performance models. This can help us
get the first ideas of the machine and application you are trying to
simulate.

To answer your question, the fetching time is computed based on the
size of the data being transfered, latency and bandwidth of the link
(in machine.platform.xml file) and the possible contention due to
other transfers occurring in parallel.

Best regards,
Luka

On 07/12/2016 12:50, Mirko Myllykoski wrote:

Hi,

my name is Mirko Myllykoski and I work as a PostDoc researcher for the NLAFET project at Umeå University.

I am currently implementing a (shared memory) numerical software using StarPU and I am trying to simulate my code using SimGrid. However, I noticed that the simulated execution time is way off (about 50%). I checked the generated FxT traces using vite and it seems that SimGrid introduces too much fetching time (State: FetchingInput) to the simulation.

How is this fetching time being computed? My performance models include the starpu_perfmodel::size_base data field and I guess that information is somehow used to compute the fetch time.

Best Regards,
Mirko Myllykoski
_______________________________________________
Starpu-devel mailing list
Starpu-devel@lists.gforge.inria.fr
http://lists.gforge.inria.fr/mailman/listinfo/starpu-devel

Best,
Luka

Re: [Starpu-devel] StarPU+SimGrid: FetchingInput computation, (suite)

Archives gérées par MHonArc 2.6.19+.

Archives de la liste

Re: [Starpu-devel] StarPU+SimGrid: FetchingInput computation