cado-nfs - Re: [Cado-nfs-discuss] Duplicate las sieve processes

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes

From: Zachary Harris <zacharyharris@hotmail.com>
To: Emmanuel Thomé <Emmanuel.Thome@gmail.com>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Duplicate las sieve processes
Date: Mon, 19 Dec 2011 11:35:42 -0500
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

The actual names of the project, directories, etc., contain confidential info for the parties involved. I could clean up the .cmd and log files with some find-and-replace work if you really, really needed them, but I don't think that will be necessary. I could be wrong here, but it seems to me the issue is in lines 1177-1254 of cadofct.pm (of course I'm in CADO NFS 1.1 here):

# Start new job(s) (parallel mode) if ($param{'parallel'}) { info "Starting new jobs...\n"; $tab_level++; HOST : for my $h (keys %machines) { my $m = $machines{$h}; my $cores= $m->{'cores'}; $cores = $m->{'poly_cores'} if ($opt->{'task'} eq "polysel"); ...... my $cmd = &{$opt->{'cmd'}}(@r, $m, $nth, $opt->{'gzip'}). " & echo \\\$!"; my $ret = remote_cmd($h, $cmd, { cmdlog => 1 }); if (!$ret->{'status'}) { chomp $ret->{'out'}; $job->{'pid'} = $ret->{'out'}; push @$jobs, $job; push @$ranges, \@r; $ranges = merge_ranges($ranges); } $tab_level--; } } write_jobs($jobs, "$param{'prefix'}.$opt->{'task'}_jobs"); $tab_level--; }

I'm far from a professional perl programmer, but it appears that new jobs are only written to the jobs file after all jobs on all hosts have started. So if you interrupt during the "Starting new jobs" process, you'll likely get duplicates.

Now, why did I go and do something fragile like interrupt during "Start new jobs" anyway? A couple of reasons:
1) A lot of jobs weren't actually starting anyway, because I was getting rsync timeout errors. I was started watching the rsync buffer files (.myfile.tmp or whatever it uses) and it appeared that everything was going along just fine up to the end. It took several minutes to transfer the 27M myprob.roots file. It looked like everything was going to succeed. But then after transferring a good 27M, cadofactor would throw an error at me saying the transfer was unsuccessful because of timeout. I'm guessing that the issue is that rsync's final checksum verification of the transferred file was taking more than 30 seconds, and since there is no "I/O" during this verification stage, that counts as a timeout violation! Can cadofactor's "timeout" be changed as a command line input parameter? If so, I didn't see that documented anyway. It would be good to document. (In my case I'd set timeout back to the default 0).
2) One reason for my rsync slowness is because I probably set things up in a dumb/naive first-time cado user (and pretty much first-time distributed computing user) manner. My "master" machine is my physical laptop on wireless. All my other hosts are on Amazon's EC2. Network transfer between the EC2 machines is of course much faster than from my laptop to them individually. So when I realized this I wanted to quit cadofactor's "Start new jobs" work, rsync myprob.roots up to one host machine on ec2, then rsync from that machine to all the others on the ec2 internal network. It worked beautifully; it was super fast to transfer the roots file between the various ec2 hosts, and then when I started up cadofactor again on my physical laptop, it quickly recognized the various myprob.roots files as being valid on each host (why did this verification step take much less than 30 seconds, whereas the verification after transfer took over 30 seconds? I don't know; some of this is just conjecture), and started parsing out jobs quite efficiently. Everything worked great, except for the fact that a couple of interrupts into "Start new jobs" ended up giving me duplicate sieving processes on a couple of my machines.

So, that's my story. Do what you want with it. Although there is always room for improvement, overall I'm very impressed with cadofactor's robustness to someone like me trying to "figure things out" one step at a time as I go along.

Many thanks,
Zach

On 12/19/2011 10:09 AM, Emmanuel Thomé wrote:

On Mon, Dec 19, 2011 at 09:50:51AM -0500, Zachary Harris wrote:

Hello,

Hi,

Thank ou for your interest in cado-nfs.

  Because of having stopped and restarted cadofactor.pl at a point while
it was passing out sieve jobs, I have a couple of machines which have
duplicate copies of "las" with the same command line running
concurrently on a single box. So, for example, we see something like this:

The situation you describe appears to be a plain bug. It's something
which ought to be handled, and apparently isn't (in your precise
situation).

You may safely kill the process which actually does not own the output
file. My guess is that if the output file which is being created looks
correct, it's because the process which was created last (presumably the
one with highest PID) somehow unlinked the output file, so that the
earliest process actually writes nowhere. You may check this by doing:

fuser /tmp/cado-data/myprob.rels.44000000-45000000.gz

which should, if my guess is correct, return only one PID which actually
should be gzip. The parent of this gzip process (say fuser above returned 6286) is readable by:

grep ^PPid /proc/6286/status

that should give you the PID of the process which is currently creating
data. The other one may safely be killed.


Now we would like to know more about the sequence of events which have
led to this. If you could provide us with the .cmd and .log files, for
instance, or more precise description of what you did, that would be
welcome.

Best,

E.

    > ps aux | grep las
    zach      3594  193  4.7 356744 187868 ?       SNl  Dec18 1181:21
    /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly
    /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0
    44000000 -q1 45000000 -mt 2 -out
    /tmp/cado-data/myprob.rels.44000000-45000000.gz
    zach      6285  193  2.5 327896 102612 ?       SNl  Dec18 1159:20
    /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly
    /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0
    44000000 -q1 45000000 -mt 2 -out
    /tmp/cado-data/myprob.rels.44000000-45000000.gz

If I peek into the output files (which seem to be about 1/4 of the way
done), they seem "OK" as far as I can tell(???). Namely, it seems things
are being done in proper order without duplicate entries. For example:

    $ zcat /tmp/cado-data/myprob.rels.44000000-45000000.gz | grep Siev
    # Sieving parameters: rlim=16000000 alim=32000000 lpbr=30 lpba=30
    # Sieving q=44000009; rho=21111199; a0=1777611; b0=-2; a1=-220133; b1=25
    # Sieving q=44000009; rho=16166352; a0=-839375; b0=19; a1=-1829836;
    b1=-11
    # Sieving q=44000009; rho=19239778; a0=163615; b0=-16; a1=2678419; b1=7
    # Sieving q=44000023; rho=34914830; a0=-1425942; b0=5; a1=529541; b1=29
    # Sieving q=44000083; rho=28079786; a0=-877065; b0=-11; a1=2006678;
    b1=-25
    # Sieving q=44000083; rho=10381365; a0=482873; b0=17; a1=2474623; b1=-4
    # Sieving q=44000083; rho=10381632; a0=487412; b0=17; a1=2473555; b1=-4
    # Sieving q=44000101; rho=9273155; a0=-189541; b0=-19; a1=-2365674;
    b1=-5
    # Sieving q=44000111; rho=37922759; a0=1458647; b0=7; a1=242764; b1=-29
    ...
    # Sieving q=44256959; rho=22450463; a0=-643967; b0=-2; a1=1199552;
    b1=-65
    # Sieving q=44256959; rho=39638757; a0=768080; b0=19; a1=-1925061; b1=10
    # Sieving q=44256997; rho=14965671; a0=-640016; b0=-3; a1=2165351;
    b1=-59
    # Sieving q=44257007; rho=14413283; a0=1017158; b0=-3; a1=1190229; b1=40
    # Sieving q=44257013; rho=37630038; a0=231539; b0=20; a1=-2131812; b1=7
    # Sieving q=44257027; rho=8141904; a0=1046890; b0=11; a1=1453727; b1=-27
    # Sieving q=44257079; rho=41400590; a0=1409744; b0=15; a1=1446745;
    b1=-16
    # Sieving q=44257091; rho=29819662; a0=944804; b0=3; a1=1210173; b1=-43
    # Sieving q=44257091; rho=18016890; a0=1570268; b0=5; a1=371971; b1=-27
    # Sieving q=44257097; rho=35764170; a0=1323079; b0=-21; a1=1792462; b1=5

However, processes that are duplicated do seem to be making
significantly slower progress than the processes that aren't duplicated.

  So, my question is: Can I safely kill off one of the duplicate
processes? Should I kill the newer or the older one? Or should I kill
off both processes and restart somehow; and if so is there anything I
need to do to ensure that I'll be able to make use of the  progress made
so far (about 20 hours worth on a couple different machines, and I'm
paying for these cloud resources).

Many thanks!

-Zach

_______________________________________________
Cado-nfs-discuss mailing list
Cado-nfs-discuss@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cado-nfs-discuss

[Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
    - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
      - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- <Possible follow-up(s)>
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes