Subject: Discussion related to cado-nfs
List archive
- From: Sudarshan Muralidhar <smural@seas.upenn.edu>
- To: Sudarshan Muralidhar <smural@seas.upenn.edu>, paul zimmermann <Paul.Zimmermann@inria.fr>
- Cc: cado-nfs-discuss@lists.gforge.inria.fr
- Subject: Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?
- Date: Wed, 10 Dec 2014 11:32:14 +0000
- List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
- List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>
Thanks, that fixed it.
One more issue that I've been running into for a few days:
Sometimes the Linear Algebra step fails with a long error message (reproduced below). In most of these cases, simply restarting the job (with the same parameters, so that the old caches for polynomial selection and pairs are retained) seems to solve the issue. Any ideas what might be causing this?
Error message:
Info:Linear Algebra: Starting
Warning:Command: Process with PID 39735 finished with return code 1
Error:Linear Algebra: Program run on server failed with exit code 1
Error:Linear Algebra: Command line was: /home/cado-nfs/build/linalg/bwc/bwc.pl :complete 'thr=4' 'mn=64' 'nullspace=left' 'interval=1000' 'matrix=/job14/c120.merge.sparse.bin' 'wdir=/job14/c120.bwc' 'interleaving=0' 'shuffled_product=1' > /job14/c120.bwc.bwc.stdout.1 2> /job14/c120.bwc.bwc.stderr.1
Error:Linear Algebra: Stderr output follows (stored in file /job14/c120.bwc.bwc.stderr.1):
b'readlink(/usr/bin//mpiexec)->/etc/alternatives/mpiexec
readlink(/etc/alternatives/mpiexec)->/usr/bin/mpiexec.openmpi
Auto-detecting openmpi based on alternatives
Using openmpi-1.4.3, MPI_BINDIR=/usr/bin/
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 mkdir -p /job14/c120.bwc
No bfile found, we need a new one
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/mf_bal --shuffled-product mfile=/job14/c120.merge.sparse.bin out=/job14/c120.bwc/ 2 2
Using 0 padding cols to obtain 2 blocks of 2*146661=293322 cols
/job14/c120.merge.sparse.bin: 586644 rows 586451 cols (193 extra rows) weight 58664427
read /job14/c120.merge.sparse.cw.bin in 0.0 s (4343.9 MB / s)
586451 cols ; avg 100.0 sdev 1299.0 [scan time 0.0 s]
sort time 0.1 s
heap fill time 0.0
Writing balancing data to /job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/dispatch nullspace=left wdir=/job14/c120.bwc thr=2x2 interval=1000 mpi=1x1 mn=64 interleaving=0 prime=2 matrix=/job14/c120.merge.sparse.bin balancing=/job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin ys=0..64 export_cachelist=/job14/c120.bwc/cachelist.4e7c3101.txt
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/dispatch nullspace=left wdir=/job14/c120.bwc thr=2x2 interval=1000 mpi=1x1 mn=64 interleaving=0 prime=2 matrix=/job14/c120.merge.sparse.bin balancing=/job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin ys=0..64 sequential_cache_build=1 sanity_check_vector=H1
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/prep nullspace=left wdir=/job14/c120.bwc thr=2x2 interval=1000 mpi=1x1 mn=64 interleaving=0 prime=2 matrix=/job14/c120.merge.sparse.bin balancing=/job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/split wdir=/job14/c120.bwc mn=64 prime=2 splits=0,64 --ifile Y.0 --ofile-fmt V%u-%u.0
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/secure nullspace=left wdir=/job14/c120.bwc thr=2x2 interval=1000 mpi=1x1 mn=64 interleaving=0 prime=2 matrix=/job14/c120.merge.sparse.bin balancing=/job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/krylov nullspace=left wdir=/job14/c120.bwc thr=2x2 interval=1000 mpi=1x1 mn=64 interleaving=0 prime=2 matrix=/job14/c120.merge.sparse.bin balancing=/job14/c120.bwc/c120.merge.sparse.2x2.4e7c3101.bin ys=0..64 start=0
Target iteration is 18348 ; going to 19000
#############################################################################
/usr/bin//mpiexec --mca plm_rsh_agent ssh -n 1 /home/cado-nfs/build/linalg/bwc/acollect nullspace=left wdir=/job14/c120.bwc interval=1000 mn=64 prime=2 --remove-old
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 39780 on
node master exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
/usr/bin//mpiexec: exited with status 1
aborted on subprogram error at /home/cado-nfs/build/linalg/bwc/bwc.pl line 500, <F> line 2.
\t...propagated at /home/cado-nfs/build/linalg/bwc/bwc.pl line 1277, <F> line 2.
'
Traceback (most recent call last):
File "/home/cado-nfs/scripts/cadofactor/cadofactor.py", line 81, in <module>
factors = factorjob.run()
File "/home/cado-nfs/scripts/cadofactor/cadotask.py", line 4886, in run
last_status, last_task = self.run_next_task()
File "/home/cado-nfs/scripts/cadofactor/cadotask.py", line 4954, in run_next_task
return [task.run(), task.title]
File "/home/cado-nfs/scripts/cadofactor/cadotask.py", line 3873, in run
raise Exception("Program failed")
Exception: Program failed
FAILED ; data left in /tmp/cado.3CTQWhAbH7
Thank you,
Sudarshan Muralidhar
Sudarshan Muralidhar
On Wed Dec 10 2014 at 6:31:32 AM Sudarshan Muralidhar <smural@seas.upenn.edu> wrote:
Thanks, that fixed it.One more issue that I've been running into for a few days:Sometimes the Linear Algebra step fails with a long error message (reproduced below). In most of these cases, simply restarting the job (with the same parameters, so that the old caches for polynomial selection and pairs are retained) seems to solve the issue. Any ideas what might be causing this?Error message:On Wed Dec 10 2014 at 5:21:11 AM paul zimmermann <Paul.Zimmermann@inria.fr> wrote:Dear Sudarshan,
> From: Sudarshan Muralidhar <smural@seas.upenn.edu>
> Date: Wed, 10 Dec 2014 10:13:05 +0000
>
> Hello,
>
> I'm using CADO-NFS to update Nadia Heninger's Factoring as a Service
> Project.
>
> When using CADO in distributed mode to factor numbers, I have no issue for
> numbers below ~100 digits. However, when I try larger digits (RSA-120, for
> example), I get failed workunits, eventually leading to an aborted job.
> My error message is listed below - does it lend any insight to something I
> may be doing wrong?
>
> Thanks,
> Sudarshan Muralidhar
>
> Error:Polynomial Selection (size optimized): Program run on node015 failed
> with exit code 134
> Error:Polynomial Selection (size optimized): Stderr output follows (stored
> in file /job10/c120.upload/c120_polyselect1_220000-224000.u2mm_t.stderr0):
> b'# Warning: this code is experimental, and not thread-safe.
> code BUG() : condition newsize == size failed in insert_hash at
> /home/cado-nfs/polyselect/auxiliary.c:3073 -- Abort
> Aborted (core dumped)
> Error:Polynomial Selection (size optimized): Exceeded maximum number of
> failed workunits, maxfailed=100
this is a known issue. The new polynomial selection code is not thread safe.
You should add tasks.polyselect.threads=1 on the cadofactor.py command line.
Best regards,
Paul Zimmermann
- [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Sudarshan Muralidhar, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, paul zimmermann, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Sudarshan Muralidhar, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Sudarshan Muralidhar, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Emmanuel Thomé, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Sudarshan Muralidhar, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, Sudarshan Muralidhar, 12/10/2014
- Re: [Cado-nfs-discuss] CADO-NFS Issue - potential bug?, paul zimmermann, 12/10/2014
Archive powered by MHonArc 2.6.19+.