Skip to Content.
Sympa Menu

cado-nfs - Re: [Cado-nfs-discuss] interleaving queries

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] interleaving queries


Chronological Thread 
  • From: Emmanuel Thomé <emmanuel.thome@gmail.com>
  • To: Junyi <9jhzguy@gmail.com>
  • Cc: "cado-nfs-discuss@lists.gforge.inria.fr" <cado-nfs-discuss@lists.gforge.inria.fr>
  • Subject: Re: [Cado-nfs-discuss] interleaving queries
  • Date: Thu, 11 Apr 2013 10:32:47 +0200
  • List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss>
  • List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Please use the git version. IIRC at least the ``last split do not
coincide'' rings a bell w.r.t a bug I fixed a few months ago.

Your "Segmentation Fault in the dispatch stage" calls for more
information. The tail of the log file would help.

Be aware that mpi=8x8 thr=2x2 hosts=n001,n002 means that you are going
to start 32 jobs with 4 threads each on the two machines. I don't know
your hahrdware specifics, but I imagine that you may be boldly
overloading your cpus then. Scheduler is at work then, and timings
don't really mean much in the end.

Tell me what you get with the git version. Don't hesitate to post
(compressed) log files, this does help getting an idea of what
happens.

Best,

E.



On Thu, Apr 11, 2013 at 9:50 AM, Junyi <9jhzguy@gmail.com> wrote:
> Apologies for the terribly unhelpful error debug request previously, and
> appreciate the real speedy response; I just wanted to make sure I got the
> syntax right.
>
> I started with somthing like /bwc.pl :complete matrix=rsa640.bin
> nullspace=left wdir=lustre/bwc.split thr=2x2 mpi=8x8 mn=64 ys=0..64
> interleaving=0 hosts=n001,n002 interval=
> 100. This has no issue.
> However, the COMMS time was taking 3x longer than the CPU (about 4s / iter,
> still on 10GbE unfortunately) so I wanted to include interleaving as an
> option.
>
> I cleaned the wdir, and ran the same command, but with interleaving=1. This
> resulted in a Segmentation Fault in the dispatch stage. README indicates
> that I should change ys such that each krylov instance gets 64 bit width,
> and use the splits parameter accordingly.
>
> I then ran it with interleaving=1, ys=0..128, splits=0,64,128, and it ran
> well until split, which indicated that "last split do not coincide with
> configured n".
>
> Checking the source, I understood it as having to set n=128, while leaving
> m=64. This reaches the krylov stage, but throws an "[err]
> event_queue_remove: 0x26b2bd0 (fd 11) not on queue 8" several times before
> exiting.
>
> Following your suggested syntax for interleaving, I ran it as such: bwc.pl
> :complete matrix=rsa640.bin nullspace=left wdir=lustre/bwc.split thr=2x2
> mpi=8x8 mn=128 ys=0..128 interleaving=0 hosts=n001,n002 interval=
> 100. However, this meets an untimely death at the secure stage, with
> message: abase_u64kl_set_ui_at: Assertion 'k < 64' failed.
>
> ---
>
> Openmpi is 1.5.4, version of cado used is 1.1, not the latest release via
> git. non-interleaved krylov currently appears to be churning along happily
> on a two-node 64-core infiniband testbed (2.04s / iter, N = 1197000). Size
> of matrix appears to be around 38M x 38M, based on the merge.log.
>
>
> On Wed, Apr 10, 2013 at 8:43 PM, Emmanuel Thomé <emmanuel.thome@gmail.com>
> wrote:
>>
>> FYI, here is a command line which successfully computes a kernel using
>> bwc's interleaving:
>>
>> ./linalg/bwc/../.././build/fondue.mpi/linalg/bwc/bwc.pl :complete
>> matrix=/local/rsa768/mats/rsa100.bin nullspace=left wdir=/tmp/bwc
>> thr=2x2 mpi=2x2 mn=128 ys=0..128 interleaving=1
>> hosts=fondue,raclette,tartiflette,berthoud mpi_extra_args="--mca
>> btl_tcp_if_exclude lo,virbr0" interval=200
>>
>> (mpi is openmpi 1.6.1 here).
>>
>> E.
>>
>> On Wed, Apr 10, 2013 at 1:07 PM, Emmanuel Thomé
>> <emmanuel.thome@gmail.com> wrote:
>> > Can you expand on "crashes out" ?
>> >
>> > E.
>> >
>> > On Wed, Apr 10, 2013 at 12:35 PM, Junyi <a0032547@nus.edu.sg> wrote:
>> >> I'm running the cado-1.1-released version on a cluster, and have been
>> >> trying
>> >> to enable interleaving at the krylov stage to mitigate the high comms
>> >> overhead.
>> >>
>> >> Interleaving = 0, mn = 64, ys=0..64, splits=0,64 currently runs well,
>> >> but
>> >> Interleaving = 1, m = 64, n = 128, ys=0..128, splits=0,64,128 crashes
>> >> out
>> >>
>> >> have i misused the parameters? please assist, thanks!
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Cado-nfs-discuss mailing list
>> >> Cado-nfs-discuss@lists.gforge.inria.fr
>> >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cado-nfs-discuss
>> >>
>>
>> _______________________________________________
>> Cado-nfs-discuss mailing list
>> Cado-nfs-discuss@lists.gforge.inria.fr
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/cado-nfs-discuss
>
>





Archive powered by MHonArc 2.6.19+.

Top of Page