cado-nfs - Re: [Cado-nfs-discuss] Too many failed work units

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] Too many failed work units

From: Philip Pemberton <philpem@philpem.me.uk>
To: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] Too many failed work units
Date: Mon, 13 Jan 2020 03:50:43 +0000
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=philpem@philpem.me.uk; spf=Pass smtp.mailfrom=philpem@philpem.me.uk; spf=None smtp.helo=postmaster@nick.sneptech.io
Ironport-phdr: 9a23:QMdCURw8J3/lP9PXCy+O+j09IxM/srCxBDY+r6Qd2+kXIJqq85mqBkHD//Il1AaPAdyAraga2qGI7+jJYi8p2d65qncMcZhBBVcuqP49uEgeOvODElDxN/XwbiY3T4xoXV5h+GynYwAOQJ6tL1LdrWev4jEMBx7xKRR6JvjvGo7Vks+7y/2+94fcbglVijexe61+IRSyoAnet8QbjpZpJ7osxBfOvnZGYfldy3lyJVKUkRb858Ow84Bm/i9Npf8v9NNOXLvjcaggQrNWEDopM2Yu5M32rhbDVheA5mEdUmoNjBVFBRXO4QzgUZfwtiv6sfd92DWfMMbrQ704RSiu4qF2QxLulSwJNSM28HvPh8JtkqxbrhKvqR9xzYHab46aNuZxcKzGcNMGRmdMRNpdWzBPD46+aYYEEuoPPfxfr4n4v1YAtxu+BQioBOPu0j9Dm2X40rM/0+s6Dw7GxhAgH9UIsH/Jq9j1LKcSUeGxzKnQ0zrDauhb2S/96IjJdhAhue+DXbdqfcrU10YjDR7FjlaJpIHjIjib2OMNs22B4OphU+Kik3YnqwFwojir3scjlIzJipgQyl/a7yl53YU1KNulQ0B4ed6pCIZcuiOZOodsQ84uXXtktSg5x7Ecu5O2fzAGxIo6yxPdcfCLbomF7xb5WOqMOzt0mm5pdK65ih2v60av0Pf8WdOx0FtSripKjN3MtncV2hzV68iIVvh98l262TaJyQ/T8v1ELl4omqrbMZIhw7kwmoISsUTFACD2hF37gLKUe0gn4OSl6vrrbq/oq5KfLYN5iALzPrwrmsOlAOQ4NgYOX3Kc+eS5zLDj5U35QLROjv0ujKbZtYvXJdwbpq64Bw9Vypgs6xOlAzejztsUh2QHLFFddBKdk4fpI03OIOz/Dfqnn1ujiipkx/ffMr3nDJXNNWHPn6rgfbZm90Fc1REzzctE6pJQC7EBO+7zWlTruNzXAB85NBa0w+n5B9ln14MeX3iPAq6DP6/Iv1+I/LFnH+7Zb4YZv3P7JeNg6//1hmIigncZfLK1xt0YZneiEfkgIkODYHOqjM1SP30Nu18GTO3uiUXKaTNWbXuoQ6U6rmUjCIOiFYrGbpitgbiZ0SL9G5AQe2MQWQPEKmvha4jRA6REUymVOMI0ymVVB4jkcJco0FSVjCG/zrNmKuTO/ShB6MD71Nlx/+DW0xQ/syF3XZ3EjjO9Clpsl2ZNfAcYmaBypUskmgWF0KN7xedeGMRP6vhJFAY9Z8aFk75KTuvqUweERe+nDU68S4z7UykxT9YpztpIZks7Btbw1h0=
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Had a read through the documentation, now it starts and quickly fails :(

Anyone know if this is recoverable? I don't mind rerunning a few WUs.

Info:Generate Factor Base: Starting
Info:Generate Factor Base: Total cpu/real time for makefb: 44.99/17.1804
Info:Generate Free Relations: Starting
Info:Generate Free Relations: Total cpu/real time for freerel: 687.31/206.264
Info:Lattice Sieving: Starting
Info:Lattice Sieving: We want 43267462 relation(s)
Info:Lattice Sieving: Found 13405 relations in '/tmp/cado.p1h1zthl/c150.upload/c150.19100000-19110000.wcylm29r.gz', total is now
34509044/43267462
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 5914, in run
last_status, last_task = self.run_next_task()
File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
return [task.run(), task.title]
File "./scripts/cadofactor/cadotask.py", line 3241, in run
self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
self.wait()
File "./scripts/cadofactor/cadotask.py", line 1627, in wait
if not self.send_request(Request.GET_WU_RESULT):
File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
return super().send_request(request)
File "./scripts/cadofactor/patterns.py", line 66, in send_request
return self.__mediator.answer_request(request)
File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
result = self.request_map[key]()
File "./scripts/cadofactor/wudb.py", line 1589, in send_result
was_received = self.notifyObservers(message)
File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
if observer.updateObserver(message):
File "./scripts/cadofactor/cadotask.py", line 3265, in updateObserver
self.verification(message.get_wu_id(), ok, commit=True)
File "./scripts/cadofactor/cadotask.py", line 1595, in verification
assert self.get_number_outstanding_wus() >= 1
AssertionError

On 13/01/2020 01:40, Philip Pemberton wrote:

Hi,

I've been running CADO-NFS on a distributed cluster where several machines have failed due to hardware issues. In summary, I've tried to add more during the CADO-NFS run and sadly some nodes have failed.

I now get this error about having too many failed workunits.

Is there any way to discard these failed/incomplete work units and pick up where things left off using the remaining nodes?

This has been a long run and I'd like to salvage as much as possible.

I've included the output from cado-nfs.py below.

Thanks,
Phil.

Info:Lattice Sieving: Starting
Info:Lattice Sieving: We want 43267462 relation(s)
Error:Lattice Sieving: Program run on mint.5425b52 failed with exit code 132
Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in file /tmp/cado.p1h1zthl/c150.upload/c150_sieving_
0000-43630000.urxwfbs6.stderr0):
Error:Lattice Sieving: # redoing q=43620013, rho=4777421 because 1s buckets are full
Error:Lattice Sieving: # Fullest level-1s bucket #173, wrote 8282/8272
Error:Lattice Sieving: Illegal instruction (core dumped)
Error:Lattice Sieving:
Error:Lattice Sieving: Exceeded maximum number of failed workunits, maxfailed=100
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
    factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 5914, in run
    last_status, last_task = self.run_next_task()
File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
    return [task.run(), task.title]
File "./scripts/cadofactor/cadotask.py", line 3241, in run
    self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
    self.wait()
File "./scripts/cadofactor/cadotask.py", line 1627, in wait
    if not self.send_request(Request.GET_WU_RESULT):
File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
    return super().send_request(request)
File "./scripts/cadofactor/patterns.py", line 66, in send_request
    return self.__mediator.answer_request(request)
File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
    result = self.request_map[key]()
File "./scripts/cadofactor/wudb.py", line 1589, in send_result
    was_received = self.notifyObservers(message)
File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
    if observer.updateObserver(message):
File "./scripts/cadofactor/cadotask.py", line 3257, in updateObserver
    if self.handle_error_result(message):
File "./scripts/cadofactor/cadotask.py", line 1701, in handle_error_result
    raise Exception("Too many failed work units")
Exception: Too many failed work units

--
Phil.
philpem@philpem.me.uk
http://www.philpem.me.uk

[Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Emmanuel Thomé, 01/13/2020
  - Re: [Cado-nfs-discuss] Too many failed work units, Phil Pemberton, 01/13/2020
    - Re: [Cado-nfs-discuss] Too many failed work units, paul zimmermann, 01/13/2020

List archive

Re: [Cado-nfs-discuss] Too many failed work units