cado-nfs - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark

From: Emmanuel Thomé <Emmanuel.Thome@inria.fr>
To: ƦOB COASTN <robertpancoast77@gmail.com>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark
Date: Thu, 7 Jan 2016 17:22:31 +0100
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

On Thu, Jan 07, 2016 at 11:00:28AM -0500, ƦOB COASTN wrote:
> Hello,
>
> Regarding efficiency, keep in mind that 512 SIMD intrinsics are
> available for the MIC, buts its different binary from AVX-512.

that pays little if the code doesn't use them...

Using Xeon Phi in a better way, taking advantage the hw ops it offers is
a nice project. But not a simple task. Not something on our roadmap,
right now.

> Also there exists a 512-bit fused multiply-add operation available for
> exploitation.

Hardly worth anything for what we're doing. (performance much more driven
by integer arithmetic, as well as memory and interconnect bandwidth).

> The logs show CPU time is much higher, but there are
> many more parallel threads per compute node. In my opinion, per
> thread performance is worse; but more importantly, the total system
> schedule for a single job is improved. I predict with the launch of
> Knights Landing comes a significant drop in cost of co-processor
> systems. Colfax International presents a strong use case scenario for
> 1x performance on Xeon PHI. "...ratio of performance to consumed
> electrical power and the ratio of performance to purchasing system
> cost, both under the assumption of linear parallel scalability of the
> application." I think its safe to assume that future cloud systems
> will operate with co-processor nodes. Nadia Heninger presents an
> interesting point in her paper, HPC hardware can essentially be leased
> by the hour.

Depends on what you're talking. Amazon won't let me rent a (say) 128-node
32 cores-per-node infiniband EDR full bandwidth cluster, you see. Not for
an affordable price, at least (and as it turns out, if I mean to use such
big iron, it's not exactly for a few hours).

> If MIC accelerated cado-nfs scales linearly,

This sort of assumption leads me to temper any enthusiasm in the
following statements pretty abruptly. There is absolutely no reason to
hope such a linear speedup. Because the bottleneck is more I/O than
crunching power!! Give me twice as many cores or twice smaller ram
(off-cache) load latency, I'll take the latter.

MIC are not as bad as GPUs to this regard, but should still be considered
as memory deprived, unless I'm wholly mistaken.

> what
> happens with cloud systems that include the MIC co-processors? The
> current state software is highly parallel, and there are many factors
> ready for a DoE, I want to get some allocations on the Chameleon Cloud
> to test this hypothesis. However the process bottleneck at the linear
> algebra stage, must first be resolved.

...which is certainly not an easy job.

E.

> https://chameleoncloud.org/about/chameleon/
> "experimentation with high-memory, large-disk, low-power, GPU, and
> co-processor units."
>
> Below is some timing data for c140, illustrating the bottleneck.
>
> E5-2603v3-MIC5110P-20160106195503.txt
> started script @ Wed 6th: 7:55pm minutes task
> 201601061955 - 201601062001 6 Client Launcher
> 201601062001 - 201601062101 60 polyselect1
> 201601062101 - 201601062154 53 polyselect2
> 201601062154 - 201601062154 1 makefb
> 201601062154 - 201601062155 1 freerel
> 201601062155 - 201601070506 431 las
> 201601070506 - 201601070533 27 filter
> 201601070533 - estimated 1800 linalg
> estimated 60 sqrt
> estimated c140 Total 2439 40.65 hrs
>
> Thanks,
> Rob
>
> On Thu, Jan 7, 2016 at 7:48 AM, Emmanuel Thomé <Emmanuel.Thome@inria.fr>
> wrote:
> > On Thu, Jan 07, 2016 at 09:01:12AM +0100, Paul Zimmermann wrote:
> >> Rob,
> >>
> >> > Using bridged networking and 8 MIC co-processors, hybrid Xeon and Xeon
> >> > Phi execution is feasible. Ten factor trials of RSA-120 finished in
> >> > under 2 hours.
> >> > (1.6337+1.6400+1.6357+1.7245+1.6255+1.7062+1.6105+1.6214+1.6021+1.6321)/10=1.64317
> >> > hrs
> >>
> >> I've just added timings on http://cado-nfs.gforge.inria.fr/ with 2.2.0:
> >> on a dual 8-core Intel(R) Xeon(R) CPU E5-2650 at 2.00GHz, factoring
> >> RSA-120 with CADO-NFS 2.2.0 takes about 2.2 hours of wall clock time
> >> (32.2 hours of cpu time). You used about 1200 hours cpu.
> >>
> >> > 1.) Will MPI enable utilization of the 1920 MIC cores for linalg?
> >>
> >> I let Emmanuel answer to that.
> >
> > probably, yes. Depends on which mpi implementation you intend to use. But
> > one may imagine that with proper mpi incantations, you may run one thread
> > per MIC core.
> >
> > This would mean passing mpi=48x40 to bwc.pl, and pass in mpi_extra_args
> > all the stuff needed to address the MIC cores. Which I have absolutely no
> > clue about.
> >
> > I doubt this will be very efficient, though.
> >
> > E.

[Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016
- Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, paul zimmermann, 01/07/2016
  - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, Emmanuel Thomé, 01/07/2016
    - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016
      - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, Emmanuel Thomé, 01/07/2016
        
        Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016

List archive

Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark