cado-nfs - Re: [Cado-nfs-discuss] Duplicate las sieve processes

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes

From: Emmanuel Thomé <Emmanuel.Thome@gmail.com>
To: Zachary Harris <zacharyharris@hotmail.com>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Duplicate las sieve processes
Date: Mon, 19 Dec 2011 16:09:10 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

On Mon, Dec 19, 2011 at 09:50:51AM -0500, Zachary Harris wrote:
> Hello,

Hi,

Thank ou for your interest in cado-nfs.

> Because of having stopped and restarted cadofactor.pl at a point while
> it was passing out sieve jobs, I have a couple of machines which have
> duplicate copies of "las" with the same command line running
> concurrently on a single box. So, for example, we see something like this:

The situation you describe appears to be a plain bug. It's something
which ought to be handled, and apparently isn't (in your precise
situation).

You may safely kill the process which actually does not own the output
file. My guess is that if the output file which is being created looks
correct, it's because the process which was created last (presumably the
one with highest PID) somehow unlinked the output file, so that the
earliest process actually writes nowhere. You may check this by doing:

fuser /tmp/cado-data/myprob.rels.44000000-45000000.gz

which should, if my guess is correct, return only one PID which actually
should be gzip. The parent of this gzip process (say fuser above returned
6286) is readable by:

grep ^PPid /proc/6286/status

that should give you the PID of the process which is currently creating
data. The other one may safely be killed.

Now we would like to know more about the sequence of events which have
led to this. If you could provide us with the .cmd and .log files, for
instance, or more precise description of what you did, that would be
welcome.

Best,

E.

>
> > ps aux | grep las
> zach 3594 193 4.7 356744 187868 ? SNl Dec18 1181:21
> /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly
> /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0
> 44000000 -q1 45000000 -mt 2 -out
> /tmp/cado-data/myprob.rels.44000000-45000000.gz
> zach 6285 193 2.5 327896 102612 ? SNl Dec18 1159:20
> /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly
> /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0
> 44000000 -q1 45000000 -mt 2 -out
> /tmp/cado-data/myprob.rels.44000000-45000000.gz
>
> If I peek into the output files (which seem to be about 1/4 of the way
> done), they seem "OK" as far as I can tell(???). Namely, it seems things
> are being done in proper order without duplicate entries. For example:
>
> $ zcat /tmp/cado-data/myprob.rels.44000000-45000000.gz | grep Siev
> # Sieving parameters: rlim=16000000 alim=32000000 lpbr=30 lpba=30
> # Sieving q=44000009; rho=21111199; a0=1777611; b0=-2; a1=-220133; b1=25
> # Sieving q=44000009; rho=16166352; a0=-839375; b0=19; a1=-1829836;
> b1=-11
> # Sieving q=44000009; rho=19239778; a0=163615; b0=-16; a1=2678419; b1=7
> # Sieving q=44000023; rho=34914830; a0=-1425942; b0=5; a1=529541; b1=29
> # Sieving q=44000083; rho=28079786; a0=-877065; b0=-11; a1=2006678;
> b1=-25
> # Sieving q=44000083; rho=10381365; a0=482873; b0=17; a1=2474623; b1=-4
> # Sieving q=44000083; rho=10381632; a0=487412; b0=17; a1=2473555; b1=-4
> # Sieving q=44000101; rho=9273155; a0=-189541; b0=-19; a1=-2365674;
> b1=-5
> # Sieving q=44000111; rho=37922759; a0=1458647; b0=7; a1=242764; b1=-29
> ...
> # Sieving q=44256959; rho=22450463; a0=-643967; b0=-2; a1=1199552;
> b1=-65
> # Sieving q=44256959; rho=39638757; a0=768080; b0=19; a1=-1925061; b1=10
> # Sieving q=44256997; rho=14965671; a0=-640016; b0=-3; a1=2165351;
> b1=-59
> # Sieving q=44257007; rho=14413283; a0=1017158; b0=-3; a1=1190229; b1=40
> # Sieving q=44257013; rho=37630038; a0=231539; b0=20; a1=-2131812; b1=7
> # Sieving q=44257027; rho=8141904; a0=1046890; b0=11; a1=1453727; b1=-27
> # Sieving q=44257079; rho=41400590; a0=1409744; b0=15; a1=1446745;
> b1=-16
> # Sieving q=44257091; rho=29819662; a0=944804; b0=3; a1=1210173; b1=-43
> # Sieving q=44257091; rho=18016890; a0=1570268; b0=5; a1=371971; b1=-27
> # Sieving q=44257097; rho=35764170; a0=1323079; b0=-21; a1=1792462; b1=5
>
> However, processes that are duplicated do seem to be making
> significantly slower progress than the processes that aren't duplicated.
>
> So, my question is: Can I safely kill off one of the duplicate
> processes? Should I kill the newer or the older one? Or should I kill
> off both processes and restart somehow; and if so is there anything I
> need to do to ensure that I'll be able to make use of the progress made
> so far (about 20 hours worth on a couple different machines, and I'm
> paying for these cloud resources).
>
> Many thanks!
>
> -Zach

> _______________________________________________
> Cado-nfs-discuss mailing list
> Cado-nfs-discuss@lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cado-nfs-discuss

[Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
    - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
      - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- <Possible follow-up(s)>
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes