cado-nfs - Re: [Cado-nfs-discuss] Too many failed work units

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Too many failed work units

From: Emmanuel Thomé <Emmanuel.Thome@inria.fr>
To: Philip Pemberton <philpem@philpem.me.uk>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Too many failed work units
Date: Mon, 13 Jan 2020 00:45:43 -0500
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Yes it's definitely recoverable. At worst you'd have to pretend you start
over, and reimport your relation set. There's documentation for that in
cado-nfs (see scripts/cadofactor/README).

Re-doing expired WUs is done automatically by cado-nfs.py. However,
expired WUs and error'd WUs are not the same thing. If a WU fails more
than [[maxwuerror]] times, it is not resubmitted, and not re-done.
If you have more WUs that return with an error than the bound
[[maxfailed]], then cado-nfs.py aborts (on purpose) and wants you to
go fix the issues you have.

(in your example below, quite probably a binary that was compiled for the
wrong architecture).

maxwuerror and maxfailed can be set in the parameter file as
tasks.maxwuerror and tasks.maxfailed respectively (defaults are 2 and 100
-- for a large computation you might want something larger, e.g. 5
and 2000).

One thing that should work in your situation is to raise maxfailed and
try to resume by restarting cado-nfs.py ; it depends exactly on how you
started cado-nfs.py for this computation, but it should be just a matter
of fixing your parameter file.

(another option is to use the command line that the script outputs at the
beginning of its output log -- beware that this example command line
contains a parameter file itself, so you should use your brain to
elaborate on that one)

If your second message is about an error that popped up after doing just
what I describe above, then it's a bug that definitely needs to be
addressed.

Going further, it is also possible to fix your database and have it
resume gracefully, ignoring all bad things that happened before. I do it
every now and then, but I have nothing completely integrated. Tell me if
that matters. I won't have time to integrate into cado-nfs main code, but
examples of what can be done, I'm happy to share (including all the
blinking danger signs that comes with it, "you're tinkering with a
database and that can bite" and so on).

E.

On Mon, Jan 13, 2020 at 01:40:57AM +0000, Philip Pemberton wrote:
> Hi,
>
> I've been running CADO-NFS on a distributed cluster where several machines
> have failed due to hardware issues. In summary, I've tried to add more
> during the CADO-NFS run and sadly some nodes have failed.
>
> I now get this error about having too many failed workunits.
>
> Is there any way to discard these failed/incomplete work units and pick up
> where things left off using the remaining nodes?
>
> This has been a long run and I'd like to salvage as much as possible.
>
> I've included the output from cado-nfs.py below.
>
> Thanks,
> Phil.
>
>
> Info:Lattice Sieving: Starting
> Info:Lattice Sieving: We want 43267462 relation(s)
> Error:Lattice Sieving: Program run on mint.5425b52 failed with exit code 132
> Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in
> file /tmp/cado.p1h1zthl/c150.upload/c150_sieving_
> 0000-43630000.urxwfbs6.stderr0):
> Error:Lattice Sieving: # redoing q=43620013, rho=4777421 because 1s buckets
> are full
> Error:Lattice Sieving: # Fullest level-1s bucket #173, wrote 8282/8272
> Error:Lattice Sieving: Illegal instruction (core dumped)
> Error:Lattice Sieving:
> Error:Lattice Sieving: Exceeded maximum number of failed workunits,
> maxfailed=100
> Traceback (most recent call last):
> File "./cado-nfs.py", line 122, in <module>
> factors = factorjob.run()
> File "./scripts/cadofactor/cadotask.py", line 5914, in run
> last_status, last_task = self.run_next_task()
> File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
> return [task.run(), task.title]
> File "./scripts/cadofactor/cadotask.py", line 3241, in run
> self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
> File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
> self.wait()
> File "./scripts/cadofactor/cadotask.py", line 1627, in wait
> if not self.send_request(Request.GET_WU_RESULT):
> File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
> return super().send_request(request)
> File "./scripts/cadofactor/patterns.py", line 66, in send_request
> return self.__mediator.answer_request(request)
> File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
> result = self.request_map[key]()
> File "./scripts/cadofactor/wudb.py", line 1589, in send_result
> was_received = self.notifyObservers(message)
> File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
> if observer.updateObserver(message):
> File "./scripts/cadofactor/cadotask.py", line 3257, in updateObserver
> if self.handle_error_result(message):
> File "./scripts/cadofactor/cadotask.py", line 1701, in handle_error_result
> raise Exception("Too many failed work units")
> Exception: Too many failed work units
>
>
> --
> Phil.
> philpem@philpem.me.uk
> http://www.philpem.me.uk
> _______________________________________________
> Cado-nfs-discuss mailing list
> Cado-nfs-discuss@lists.gforge.inria.fr
> https://lists.gforge.inria.fr/mailman/listinfo/cado-nfs-discuss

[Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Emmanuel Thomé, 01/13/2020
  - Re: [Cado-nfs-discuss] Too many failed work units, Phil Pemberton, 01/13/2020
    - Re: [Cado-nfs-discuss] Too many failed work units, paul zimmermann, 01/13/2020

List archive

Re: [Cado-nfs-discuss] Too many failed work units