cado-nfs - [Cado-nfs-discuss] Too many failed work units

Subject: Discussion related to cado-nfs

List archive

[Cado-nfs-discuss] Too many failed work units

From: Philip Pemberton <philpem@philpem.me.uk>
To: cado-nfs-discuss@lists.gforge.inria.fr
Subject: [Cado-nfs-discuss] Too many failed work units
Date: Mon, 13 Jan 2020 01:40:57 +0000
Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=philpem@philpem.me.uk; spf=Pass smtp.mailfrom=philpem@philpem.me.uk; spf=None smtp.helo=postmaster@nick.sneptech.io
Ironport-phdr: 9a23:ncko5BBacFXharc9yLNlUyQJP3N1i/DPJgcQr6AfoPdwSPT5oMbcNUDSrc9gkEXOFd2Cra4d0KyM7fGrAjxIyK3CmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TW94jEIBxrwKxd+KPjrFY7OlcS30P2594HObwlSizexfL1/IA+ooQjQssQajoVvJ6UswRbVv3VEfPhby3l1LlyJhRb84cmw/J9n8ytOvv8q6tBNX6bncakmVLJUFDspPXw7683trhnDUBCA5mAAXWUMkxpHGBbK4RfnVZrsqCT6t+592C6HPc3qSL0/RDqv47t3RBLulSwKMSMy/mPKhcxqlK9VvhKvqQF8zYDabo6aO+ZxcKzGcNMGR2dMRNpdWzBPD46+aYYEEuoPPfxfr4n4v1YAtxu+BQioBOPu0j9Dm2X40rM/0+s6Dw7GxhAgH9UIsH/Jq9j1LKcSUeGxzKnQ0zrDauhb2S/96IjJdhAhue+DXbdqfcrU10YjDR7FjlaJpIHjIjib2OMNs22B4OphU+Kik3YnqwFwojir3scjlIzJipgQyl/a7yl53YU1KNulQ0B4ed6pCIZcuiOZOodsQ84uXXtktSg5x7Ecu5O2fzAGxIo6yxPdcfCLbomF7xb5WOqMOzt0mm5pdK65ih2v60av0Pf8WdOx0FtSripKjN3MtncV2hzV68iIVvh98l262TaJyQ/T8v1ELl4omqrbMZIhw7kwmoISsUTFACD2hF37gLKUe0gn4OSl6vrrbq/oq5KfLYN5iALzPrwrmsOlAOQ4NgYOX3Kc+eS5zLDj5U35QLROjv0ujKbZtYvXJdwbpq64Bw9Vypgs6xOlAzejztsUh2QHLFFddBKdk4fpI03OIOz/Dfqnn1ujiipkx/ffMr3nDJXNNWHPn6rgfbZm90Fc1REzzctE6pJQC7EBO+7zWlTruNzXAB85NBa0w+n5B9ln14MeX3iPAq6DP6/Iv1+I/LFnH+7Zb4YZv3P7JeNg6//1hmIigncZfLK1xt0YZneiEfkgIkODYHOqjM1SP30Nu18GTO3uiUXKaTNWbXuoQ6U6rmUjCIOiFYrGbpitgbiZ0SL9G5AQe2MQWQPEKmvha4jRA6REUymVOMI0ymVVB4jkcJco0FSVjCG/zrNmKuTO/ShB6MD71Nlx/+DW0xQ/syF3XZ3EjjO9Clpsl2ZNfAcYmaBypUskmgWF0KN7xedeGMRP6vhJFAY9Z8aFk75KTuvqUweERe+nDU68S4z7UykxT9YpztpIZks7Btbw1h0=
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Hi,

I've been running CADO-NFS on a distributed cluster where several machines have failed due to hardware issues. In summary, I've tried to add more during the CADO-NFS run and sadly some nodes have failed.

I now get this error about having too many failed workunits.

Is there any way to discard these failed/incomplete work units and pick up where things left off using the remaining nodes?

This has been a long run and I'd like to salvage as much as possible.

I've included the output from cado-nfs.py below.

Thanks,
Phil.

Info:Lattice Sieving: Starting
Info:Lattice Sieving: We want 43267462 relation(s)
Error:Lattice Sieving: Program run on mint.5425b52 failed with exit code 132
Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in file /tmp/cado.p1h1zthl/c150.upload/c150_sieving_
0000-43630000.urxwfbs6.stderr0):
Error:Lattice Sieving: # redoing q=43620013, rho=4777421 because 1s buckets are full
Error:Lattice Sieving: # Fullest level-1s bucket #173, wrote 8282/8272
Error:Lattice Sieving: Illegal instruction (core dumped)
Error:Lattice Sieving:
Error:Lattice Sieving: Exceeded maximum number of failed workunits, maxfailed=100
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 5914, in run
last_status, last_task = self.run_next_task()
File "./scripts/cadofactor/cadotask.py", line 6006, in run_next_task
return [task.run(), task.title]
File "./scripts/cadofactor/cadotask.py", line 3241, in run
self.submit_command(p, "%d-%d" % (q0, q1), commit=False)
File "./scripts/cadofactor/cadotask.py", line 1551, in submit_command
self.wait()
File "./scripts/cadofactor/cadotask.py", line 1627, in wait
if not self.send_request(Request.GET_WU_RESULT):
File "./scripts/cadofactor/cadotask.py", line 1412, in send_request
return super().send_request(request)
File "./scripts/cadofactor/patterns.py", line 66, in send_request
return self.__mediator.answer_request(request)
File "./scripts/cadofactor/cadotask.py", line 6074, in answer_request
result = self.request_map[key]()
File "./scripts/cadofactor/wudb.py", line 1589, in send_result
was_received = self.notifyObservers(message)
File "./scripts/cadofactor/patterns.py", line 32, in notifyObservers
if observer.updateObserver(message):
File "./scripts/cadofactor/cadotask.py", line 3257, in updateObserver
if self.handle_error_result(message):
File "./scripts/cadofactor/cadotask.py", line 1701, in handle_error_result
raise Exception("Too many failed work units")
Exception: Too many failed work units

--
Phil.
philpem@philpem.me.uk
http://www.philpem.me.uk

[Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Philip Pemberton, 01/13/2020
- Re: [Cado-nfs-discuss] Too many failed work units, Emmanuel Thomé, 01/13/2020
  - Re: [Cado-nfs-discuss] Too many failed work units, Phil Pemberton, 01/13/2020
    - Re: [Cado-nfs-discuss] Too many failed work units, paul zimmermann, 01/13/2020

List archive

[Cado-nfs-discuss] Too many failed work units