Subject: Discussion related to cado-nfs
List archive
- From: Zachary Harris <zacharyharris@hotmail.com>
- To: Emmanuel Thomé <Emmanuel.Thome@gmail.com>
- Cc: cado-nfs-discuss@lists.gforge.inria.fr
- Subject: Re: [Cado-nfs-discuss] Duplicate las sieve processes
- Date: Mon, 19 Dec 2011 11:35:42 -0500
- List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss>
- List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>
The actual names of the project, directories, etc., contain
confidential info for the parties involved. I could clean up the
.cmd and log files with some find-and-replace work if you really,
really needed them, but I don't think that will be necessary. I
could be wrong here, but it seems to me the issue is in lines
1177-1254 of cadofct.pm (of course I'm in CADO NFS 1.1 here):# Start new job(s) (parallel mode)I'm far from a professional perl programmer, but it appears that new jobs are only written to the jobs file after all jobs on all hosts have started. So if you interrupt during the "Starting new jobs" process, you'll likely get duplicates. Now, why did I go and do something fragile like interrupt during "Start new jobs" anyway? A couple of reasons: 1) A lot of jobs weren't actually starting anyway, because I was getting rsync timeout errors. I was started watching the rsync buffer files (.myfile.tmp or whatever it uses) and it appeared that everything was going along just fine up to the end. It took several minutes to transfer the 27M myprob.roots file. It looked like everything was going to succeed. But then after transferring a good 27M, cadofactor would throw an error at me saying the transfer was unsuccessful because of timeout. I'm guessing that the issue is that rsync's final checksum verification of the transferred file was taking more than 30 seconds, and since there is no "I/O" during this verification stage, that counts as a timeout violation! Can cadofactor's "timeout" be changed as a command line input parameter? If so, I didn't see that documented anyway. It would be good to document. (In my case I'd set timeout back to the default 0). 2) One reason for my rsync slowness is because I probably set things up in a dumb/naive first-time cado user (and pretty much first-time distributed computing user) manner. My "master" machine is my physical laptop on wireless. All my other hosts are on Amazon's EC2. Network transfer between the EC2 machines is of course much faster than from my laptop to them individually. So when I realized this I wanted to quit cadofactor's "Start new jobs" work, rsync myprob.roots up to one host machine on ec2, then rsync from that machine to all the others on the ec2 internal network. It worked beautifully; it was super fast to transfer the roots file between the various ec2 hosts, and then when I started up cadofactor again on my physical laptop, it quickly recognized the various myprob.roots files as being valid on each host (why did this verification step take much less than 30 seconds, whereas the verification after transfer took over 30 seconds? I don't know; some of this is just conjecture), and started parsing out jobs quite efficiently. Everything worked great, except for the fact that a couple of interrupts into "Start new jobs" ended up giving me duplicate sieving processes on a couple of my machines. So, that's my story. Do what you want with it. Although there is always room for improvement, overall I'm very impressed with cadofactor's robustness to someone like me trying to "figure things out" one step at a time as I go along. Many thanks, Zach On 12/19/2011 10:09 AM, Emmanuel Thomé wrote: On Mon, Dec 19, 2011 at 09:50:51AM -0500, Zachary Harris wrote: Hello, Hi, Thank ou for your interest in cado-nfs. Because of having stopped and restarted cadofactor.pl at a point while it was passing out sieve jobs, I have a couple of machines which have duplicate copies of "las" with the same command line running concurrently on a single box. So, for example, we see something like this: The situation you describe appears to be a plain bug. It's something which ought to be handled, and apparently isn't (in your precise situation). You may safely kill the process which actually does not own the output file. My guess is that if the output file which is being created looks correct, it's because the process which was created last (presumably the one with highest PID) somehow unlinked the output file, so that the earliest process actually writes nowhere. You may check this by doing: fuser /tmp/cado-data/myprob.rels.44000000-45000000.gz which should, if my guess is correct, return only one PID which actually should be gzip. The parent of this gzip process (say fuser above returned 6286) is readable by: grep ^PPid /proc/6286/status that should give you the PID of the process which is currently creating data. The other one may safely be killed. Now we would like to know more about the sequence of events which have led to this. If you could provide us with the .cmd and .log files, for instance, or more precise description of what you did, that would be welcome. Best, E. > ps aux | grep las zach 3594 193 4.7 356744 187868 ? SNl Dec18 1181:21 /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0 44000000 -q1 45000000 -mt 2 -out /tmp/cado-data/myprob.rels.44000000-45000000.gz zach 6285 193 2.5 327896 102612 ? SNl Dec18 1159:20 /opt/cado-nfs-1.1/installed/bin/sieve/las -I 13 -poly /tmp/cado-data/myprob.poly -fb /tmp/cado-data/myprob.roots -q0 44000000 -q1 45000000 -mt 2 -out /tmp/cado-data/myprob.rels.44000000-45000000.gz If I peek into the output files (which seem to be about 1/4 of the way done), they seem "OK" as far as I can tell(???). Namely, it seems things are being done in proper order without duplicate entries. For example: $ zcat /tmp/cado-data/myprob.rels.44000000-45000000.gz | grep Siev # Sieving parameters: rlim=16000000 alim=32000000 lpbr=30 lpba=30 # Sieving q=44000009; rho=21111199; a0=1777611; b0=-2; a1=-220133; b1=25 # Sieving q=44000009; rho=16166352; a0=-839375; b0=19; a1=-1829836; b1=-11 # Sieving q=44000009; rho=19239778; a0=163615; b0=-16; a1=2678419; b1=7 # Sieving q=44000023; rho=34914830; a0=-1425942; b0=5; a1=529541; b1=29 # Sieving q=44000083; rho=28079786; a0=-877065; b0=-11; a1=2006678; b1=-25 # Sieving q=44000083; rho=10381365; a0=482873; b0=17; a1=2474623; b1=-4 # Sieving q=44000083; rho=10381632; a0=487412; b0=17; a1=2473555; b1=-4 # Sieving q=44000101; rho=9273155; a0=-189541; b0=-19; a1=-2365674; b1=-5 # Sieving q=44000111; rho=37922759; a0=1458647; b0=7; a1=242764; b1=-29 ... # Sieving q=44256959; rho=22450463; a0=-643967; b0=-2; a1=1199552; b1=-65 # Sieving q=44256959; rho=39638757; a0=768080; b0=19; a1=-1925061; b1=10 # Sieving q=44256997; rho=14965671; a0=-640016; b0=-3; a1=2165351; b1=-59 # Sieving q=44257007; rho=14413283; a0=1017158; b0=-3; a1=1190229; b1=40 # Sieving q=44257013; rho=37630038; a0=231539; b0=20; a1=-2131812; b1=7 # Sieving q=44257027; rho=8141904; a0=1046890; b0=11; a1=1453727; b1=-27 # Sieving q=44257079; rho=41400590; a0=1409744; b0=15; a1=1446745; b1=-16 # Sieving q=44257091; rho=29819662; a0=944804; b0=3; a1=1210173; b1=-43 # Sieving q=44257091; rho=18016890; a0=1570268; b0=5; a1=371971; b1=-27 # Sieving q=44257097; rho=35764170; a0=1323079; b0=-21; a1=1792462; b1=5 However, processes that are duplicated do seem to be making significantly slower progress than the processes that aren't duplicated. So, my question is: Can I safely kill off one of the duplicate processes? Should I kill the newer or the older one? Or should I kill off both processes and restart somehow; and if so is there anything I need to do to ensure that I'll be able to make use of the progress made so far (about 20 hours worth on a couple different machines, and I'm paying for these cloud resources). Many thanks! -Zach _______________________________________________ Cado-nfs-discuss mailing list Cado-nfs-discuss@lists.gforge.inria.fr http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cado-nfs-discuss |
- [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
- <Possible follow-up(s)>
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
Archive powered by MHonArc 2.6.19+.