cado-nfs - Re: [Cado-nfs-discuss] Too many failed work units

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Too many failed work units

From: Phil Pemberton <philpem@philpem.me.uk>
To: Emmanuel Thomé <Emmanuel.Thome@inria.fr>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Too many failed work units
Date: Mon, 13 Jan 2020 11:05:41 +0000 (GMT)
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=philpem@philpem.me.uk; spf=Pass smtp.mailfrom=philpem@philpem.me.uk; spf=None smtp.helo=postmaster@nick.sneptech.io
Ironport-phdr: 9a23:RAEUjhWEIebFfGjIyGuDiaO6c0DV8LGtZVwlr6E/grcLSJyIuqrYbRSPt8tkgFKBZ4jH8fUM07OQ7/m7HzZfud3b6jgrS99lb1c9k8IYnggtUoauKHbQC7rUVRE8B9lIT1R//nu2YgB/Ecf6YEDO8DXptWZBUhrwOhBoKevrB4Xck9q41/yo+53Ufg5EmCexbal9IRmrowjdrNcajIl+Jqo+1BfFvGZDdvhLy29vOV+dhQv36N2q/J5k/SRQuvYh+NBFXK7nYak2TqFWASo/PWwt68LlqRfMTQ2U5nsBSWoWiQZHAxLE7B7hQJj8tDbxu/dn1ymbOc32Sq00WSin4qx2RhLklDsLOjgk+2zMlMd+kLxUrw6gpxxnwo7bfoeVNOZlfqjAed8WXHdNUtpNWyBEBI63cokBAPcbPetAoYfzp0UAowa8CgevCuPgxDBHiWPt0K0/z+gtDRvL0BA6Et8KtnnfsdX7NL0VUeCw1KTG1zXDb/JS2Tzg8obHbBUhruqSUrJqbcrRzk8vHB7Cg1WIqYzlPjeV1vwTvGie9OdgTeKvi28jqwFpvDevw90giozXiY4P11DE9jx0zYAoLtO2T057ZMSrEJpWtyyCL4t2QsIiQ2VwuCkkz70Ko5u7czYQxJQ6xB7Tc/2Hc46S4hLiTumdOzl4hGh9dL2jnRm97E+gxvT6Vsm6y1ZGtDJFk9nKu3sQ1BLT8tCKRuZ/80qiwzqC1h7f5vtKLE03j6bWKZ0szqYumpYOs0nPBDL6lUTygaOMa0ko4Pak5/j7brjgu5SSLZV7ihvkPaQrgsG/Afo3MgwJX2WD/+S81aHs/U7jTLVRiP05jLHZsIzEKssHpq61GQ5V0oE75xa+CTepzsgYkGEaIF9Hex+LlYnkN0/ULP32DvqzmVahnTRzy/DDJLLhA5HNLnbZkLfmeLZw81ZcyAoyzdBb/5JbFLQBLenrWk/xtdzYCgc5PBKxw+r9DdVyyJkSWX+MAqOBKqPdrUeI5v4zI+mLfIIapCzyJOUi5/L3i385l0QdcbC00psWc3C3AulmI16CYXf3htcBEHwKvhYlTODwh12CXzlTZ2y9X60i/D07CYSmDZ3CRo+3mrCB0j27TdVqYTVrD1WFF2rlc7K4W/AJ6WrGD8pkmzoZWLznd48m0ByGtQngyrMhIPCCqQMCspe27Nh046XrmBQ4+CZoCMLVh3mNTmVsk2YgXz832LhypAp4xxGe0v4r0LRjCdVP6qYRAU8BPpnGwrk/UoiqA1OTTpKyUF+jB+6eL3Q0R9M1zcUJZh8gSc2ijxTb0i/sBrJTir/ZXcVpoJKZ5GD4IoNG81iD1KQliAB2EM1OPm7gnahy7xTeDI6PkkLLz//2J5RZ5zbE8SK49UTLpFtRCVMiSaXEWWgSYw3ToJLk5RGaQg==
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Hi Emmanuel,

Thanks for the help. I've already tried to use the wudb tool but can't get it to run:

philpem@syrys:~/cado-nfs/cado-nfs/scripts/cadofactor$ ./wudb.py -dbfile /tmp/cado.p1h1zthl/c150.db
Traceback (most recent call last):
File "./wudb.py", line 1855, in <module>
db_pool = WuAccess(dbname)
File "./wudb.py", line 1132, in __init__
raise ValueError("unexpected")
ValueError: unexpected
Exception ignored in: <bound method WuAccess.__del__ of <__main__.WuAccess object at 0x7f9a42c55710>>
Traceback (most recent call last):
File "./wudb.py", line 1149, in __del__
if self._ownconn:
AttributeError: 'WuAccess' object has no attribute '_ownconn'

I'm starting to suspect there's been some database corruption too, because I'm seeing this when I try to restart the server:

philpem@syrys:~/cado-nfs/cado-nfs$ ./cado-nfs.py /tmp/cado.p1h1zthl/c150.parameters_snapshot.9

(snip)
Info:Lattice Sieving: We want 43267462 relation(s)
Info:Lattice Sieving: Found 13405 relations in '/tmp/cado.p1h1zthl/c150.upload/c150.19100000-19110000.wcylm
34509044/43267462
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 5914, in run
last_status, last_task = self.run_next_task()
File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
return [task.run(), task.title]
File "./scripts/cadofactor/cadotask.py", line 3241, in run
self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
self.wait()
File "./scripts/cadofactor/cadotask.py", line 1627, in wait
if not self.send_request(Request.GET_WU_RESULT):
File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
return super().send_request(request)
File "./scripts/cadofactor/patterns.py", line 66, in send_request
return self.__mediator.answer_request(request)
File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
result = self.request_map[key]()
File "./scripts/cadofactor/wudb.py", line 1589, in send_result
was_received = self.notifyObservers(message)
File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
if observer.updateObserver(message):
File "./scripts/cadofactor/cadotask.py", line 3265, in updateObserver
self.verification(message.get_wu_id(), ok, commit=True)
File "./scripts/cadofactor/cadotask.py", line 1595, in verification
assert self.get_number_outstanding_wus() >= 1
AssertionError

I've had a quick look in the database file with sqlitebrowser and it looks like there are a bunch of duplicate work units with names of the form "<workunit>#2" and "<workunit>#3" with a variety of status codes.

Thanks
Phil.

On Mon, 13 Jan 2020, Emmanuel Thomé wrote:

Yes it's definitely recoverable. At worst you'd have to pretend you start
over, and reimport your relation set. There's documentation for that in
cado-nfs (see scripts/cadofactor/README).

Re-doing expired WUs is done automatically by cado-nfs.py. However,
expired WUs and error'd WUs are not the same thing. If a WU fails more
than [[maxwuerror]] times, it is not resubmitted, and not re-done.
If you have more WUs that return with an error than the bound
[[maxfailed]], then cado-nfs.py aborts (on purpose) and wants you to
go fix the issues you have.

(in your example below, quite probably a binary that was compiled for the
wrong architecture).

maxwuerror and maxfailed can be set in the parameter file as
tasks.maxwuerror and tasks.maxfailed respectively (defaults are 2 and 100
-- for a large computation you might want something larger, e.g. 5
and 2000).

One thing that should work in your situation is to raise maxfailed and
try to resume by restarting cado-nfs.py ; it depends exactly on how you
started cado-nfs.py for this computation, but it should be just a matter
of fixing your parameter file.

(another option is to use the command line that the script outputs at the
beginning of its output log -- beware that this example command line
contains a parameter file itself, so you should use your brain to
elaborate on that one)

If your second message is about an error that popped up after doing just
what I describe above, then it's a bug that definitely needs to be
addressed.

Going further, it is also possible to fix your database and have it
resume gracefully, ignoring all bad things that happened before. I do it
every now and then, but I have nothing completely integrated. Tell me if
that matters. I won't have time to integrate into cado-nfs main code, but
examples of what can be done, I'm happy to share (including all the
blinking danger signs that comes with it, "you're tinkering with a
database and that can bite" and so on).

E.

On Mon, Jan 13, 2020 at 01:40:57AM +0000, Philip Pemberton wrote:

Hi,

I've been running CADO-NFS on a distributed cluster where several machines
have failed due to hardware issues. In summary, I've tried to add more
during the CADO-NFS run and sadly some nodes have failed.

I now get this error about having too many failed workunits.

Is there any way to discard these failed/incomplete work units and pick up
where things left off using the remaining nodes?

This has been a long run and I'd like to salvage as much as possible.

I've included the output from cado-nfs.py below.

Thanks,
Phil.

Info:Lattice Sieving: Starting
Info:Lattice Sieving: We want 43267462 relation(s)
Error:Lattice Sieving: Program run on mint.5425b52 failed with exit code 132
Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in
file /tmp/cado.p1h1zthl/c150.upload/c150_sieving_
0000-43630000.urxwfbs6.stderr0):
Error:Lattice Sieving: # redoing q=43620013, rho=4777421 because 1s buckets
are full
Error:Lattice Sieving: # Fullest level-1s bucket #173, wrote 8282/8272
Error:Lattice Sieving: Illegal instruction (core dumped)
Error:Lattice Sieving:
Error:Lattice Sieving: Exceeded maximum number of failed workunits,
maxfailed=100
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 5914, in run
last_status, last_task = self.run_next_task()
File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
return [task.run(), task.title]
File "./scripts/cadofactor/cadotask.py", line 3241, in run
self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
self.wait()
File "./scripts/cadofactor/cadotask.py", line 1627, in wait
if not self.send_request(Request.GET_WU_RESULT):
File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
return super().send_request(request)
File "./scripts/cadofactor/patterns.py", line 66, in send_request
return self.__mediator.answer_request(request)
File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
result = self.request_map[key]()
File "./scripts/cadofactor/wudb.py", line 1589, in send_result
was_received = self.notifyObservers(message)
File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
if observer.updateObserver(message):
File "./scripts/cadofactor/cadotask.py", line 3257, in updateObserver
if self.handle_error_result(message):
File "./scripts/cadofactor/cadotask.py", line 1701, in handle_error_result
raise Exception("Too many failed work units")
Exception: Too many failed work units

--
Phil.
philpem@philpem.me.uk
http://www.philpem.me.uk
_______________________________________________
Cado-nfs-discuss mailing list
Cado-nfs-discuss@lists.gforge.inria.fr
https://lists.gforge.inria.fr/mailman/listinfo/cado-nfs-discuss

--
Phil.
philpem@philpem.me.uk
http://www.philpem.me.uk/

[Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Emmanuel Thomé, 01/13/2020
  - Re: [Cado-nfs-discuss] Too many failed work units, Phil Pemberton, 01/13/2020
    - Re: [Cado-nfs-discuss] Too many failed work units, paul zimmermann, 01/13/2020

List archive

Re: [Cado-nfs-discuss] Too many failed work units