cado-nfs - Re: [Cado-nfs-discuss] Duplicate las sieve processes

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes

From: Zachary Harris <zacharyharris@hotmail.com>
To: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Duplicate las sieve processes
Date: Mon, 19 Dec 2011 14:09:28 -0500
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

OK, so I basically ended up doing what it says to do in mach_desc in a
situation where "you cannot wait for the job to be finished".
1) I set cores=0 in mach_decs for the machines with duplicate processes.
2) Killed off ALL the duplicate processes on all hosts that had them.
3) Restarted cadofactor.
a) It noted the dead processes.
b) It rsynced and read the partially completed relations files, noting
the data was truncated as such and such a point.
c) I now had a couple hosts cleanly sitting idle with their partial
information taken off them.
4) I put back cores=8 on the hosts that were now sitting idle.
5) Restarted cadofactor.
a) It sent out jobs to the now/again available cores to fill in the
gaps of the remainder of the partially completed sieve sections (e.g.
32250000-33000000, 33250000-34000000, 34250000-35000000, etc.).

In summary, I'd say it recovered quite well. So again, while I know
that with software projects there is always more that could be done, I
want to encourage you all that I'm impressed with cadofactor's
robustness. I consider that to be a top desired feature for a
computationally demanding applications like factorization. Nobody wants
to spend hours, days, weeks, or more of computing time, only to have
something go wrong and lose all that progress. Cado's checkpointing,
fault tolerance, partial information processing ability, and automated
recovery make this (in my opinion) a professionally done product. Could
some things be even more user friendly, even more stable, faster? I'm
sure they could, but based on my experience so far, my primary feedback
would be, "Good job with version 1.1!"

If things go well and I get through this (looking good so far), if
it's helpful, perhaps I'll be able to provide you with some statistics
at the end.

-Zach

[Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
    - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011
      - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
- <Possible follow-up(s)>
- Re: [Cado-nfs-discuss] Duplicate las sieve processes, Zachary Harris, 12/19/2011
  - Re: [Cado-nfs-discuss] Duplicate las sieve processes, Emmanuel Thomé, 12/19/2011

List archive

Re: [Cado-nfs-discuss] Duplicate las sieve processes