cado-nfs - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark

From: ƦOB COASTN <robertpancoast77@gmail.com>
To: Emmanuel Thomé <Emmanuel.Thome@inria.fr>
Cc: cado-nfs-discuss@lists.gforge.inria.fr
Subject: Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark
Date: Thu, 7 Jan 2016 11:49:10 -0500
Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=robertpancoast77@gmail.com; spf=Pass smtp.mailfrom=robertpancoast77@gmail.com; spf=None smtp.helo=postmaster@mail-lb0-f182.google.com
Ironport-phdr: 9a23:SiuvXxPti6/45TRhBMYl6mtUPXoX/o7sNwtQ0KIMzox0KPvyrarrMEGX3/hxlliBBdydsKIazbKO+4nbGkU+or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6anHS+4HYoFwnlMkItf6KuStCU15z//tvx0qOQSj0AvCC6b7J2IUf+hiTqne5Sv7FfLL0swADCuHpCdrce72ppIVWOg0S0vZ/or9ZLuh5dsPM59sNGTb6yP+FhFeQZX3waNDUY4cjiswTOSTyz5nwZ0y1Cvx9NCg7Y4RW8Ypf2tybSt+xn2SDcM9egHp4uXjH3CIBBfzTPoRw7EXZt6mbdh9ZslKtdqxWovAAgnKbbZYiUMLx1eaaLLoBSfnZIQssED38JOYi7dYZaSrNZZes=
List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Excellent insight! Can you please explain further "the bottleneck is
more I/O than crunching power"? I want to understand what that means.
What stages of the algorithm are most affected? When its crunching,
the MIC utilization tool, micsmc, shows ~8GB RAM usage and 100% user
CPU utilization and 0% system CPU utilization for polyselect and las.
It appears that cado-nfs is actually using ALL available compute
resources on each of the 8 MICS. I wonder, how does the performance of
(8) 5110P MIC compare to (8) E5-2650?

On Thu, Jan 7, 2016 at 11:22 AM, Emmanuel Thomé <Emmanuel.Thome@inria.fr>
wrote:
> On Thu, Jan 07, 2016 at 11:00:28AM -0500, ƦOB COASTN wrote:
>> Hello,
>>
>> Regarding efficiency, keep in mind that 512 SIMD intrinsics are
>> available for the MIC, buts its different binary from AVX-512.
>
> that pays little if the code doesn't use them...
>
> Using Xeon Phi in a better way, taking advantage the hw ops it offers is
> a nice project. But not a simple task. Not something on our roadmap,
> right now.
>
>> Also there exists a 512-bit fused multiply-add operation available for
>> exploitation.
>
> Hardly worth anything for what we're doing. (performance much more driven
> by integer arithmetic, as well as memory and interconnect bandwidth).
>
>> The logs show CPU time is much higher, but there are
>> many more parallel threads per compute node. In my opinion, per
>> thread performance is worse; but more importantly, the total system
>> schedule for a single job is improved. I predict with the launch of
>> Knights Landing comes a significant drop in cost of co-processor
>> systems. Colfax International presents a strong use case scenario for
>> 1x performance on Xeon PHI. "...ratio of performance to consumed
>> electrical power and the ratio of performance to purchasing system
>> cost, both under the assumption of linear parallel scalability of the
>> application." I think its safe to assume that future cloud systems
>> will operate with co-processor nodes. Nadia Heninger presents an
>> interesting point in her paper, HPC hardware can essentially be leased
>> by the hour.
>
> Depends on what you're talking. Amazon won't let me rent a (say) 128-node
> 32 cores-per-node infiniband EDR full bandwidth cluster, you see. Not for
> an affordable price, at least (and as it turns out, if I mean to use such
> big iron, it's not exactly for a few hours).
>
>> If MIC accelerated cado-nfs scales linearly,
>
> This sort of assumption leads me to temper any enthusiasm in the
> following statements pretty abruptly. There is absolutely no reason to
> hope such a linear speedup. Because the bottleneck is more I/O than
> crunching power!! Give me twice as many cores or twice smaller ram
> (off-cache) load latency, I'll take the latter.
>
> MIC are not as bad as GPUs to this regard, but should still be considered
> as memory deprived, unless I'm wholly mistaken.
>
>> what
>> happens with cloud systems that include the MIC co-processors? The
>> current state software is highly parallel, and there are many factors
>> ready for a DoE, I want to get some allocations on the Chameleon Cloud
>> to test this hypothesis. However the process bottleneck at the linear
>> algebra stage, must first be resolved.
>
> ...which is certainly not an easy job.
>
>
> E.
>
>> https://chameleoncloud.org/about/chameleon/
>> "experimentation with high-memory, large-disk, low-power, GPU, and
>> co-processor units."
>>
>> Below is some timing data for c140, illustrating the bottleneck.
>>
>> E5-2603v3-MIC5110P-20160106195503.txt
>> started script @ Wed 6th: 7:55pm minutes task
>> 201601061955 - 201601062001 6 Client Launcher
>> 201601062001 - 201601062101 60 polyselect1
>> 201601062101 - 201601062154 53 polyselect2
>> 201601062154 - 201601062154 1 makefb
>> 201601062154 - 201601062155 1 freerel
>> 201601062155 - 201601070506 431 las
>> 201601070506 - 201601070533 27 filter
>> 201601070533 - estimated 1800 linalg
>> estimated 60 sqrt
>> estimated c140 Total 2439 40.65 hrs
>>
>> Thanks,
>> Rob
>>
>> On Thu, Jan 7, 2016 at 7:48 AM, Emmanuel Thomé <Emmanuel.Thome@inria.fr>
>> wrote:
>> > On Thu, Jan 07, 2016 at 09:01:12AM +0100, Paul Zimmermann wrote:
>> >> Rob,
>> >>
>> >> > Using bridged networking and 8 MIC co-processors, hybrid Xeon and Xeon
>> >> > Phi execution is feasible. Ten factor trials of RSA-120 finished in
>> >> > under 2 hours.
>> >> > (1.6337+1.6400+1.6357+1.7245+1.6255+1.7062+1.6105+1.6214+1.6021+1.6321)/10=1.64317
>> >> > hrs
>> >>
>> >> I've just added timings on http://cado-nfs.gforge.inria.fr/ with 2.2.0:
>> >> on a dual 8-core Intel(R) Xeon(R) CPU E5-2650 at 2.00GHz, factoring
>> >> RSA-120 with CADO-NFS 2.2.0 takes about 2.2 hours of wall clock time
>> >> (32.2 hours of cpu time). You used about 1200 hours cpu.
>> >>
>> >> > 1.) Will MPI enable utilization of the 1920 MIC cores for linalg?
>> >>
>> >> I let Emmanuel answer to that.
>> >
>> > probably, yes. Depends on which mpi implementation you intend to use. But
>> > one may imagine that with proper mpi incantations, you may run one thread
>> > per MIC core.
>> >
>> > This would mean passing mpi=48x40 to bwc.pl, and pass in mpi_extra_args
>> > all the stuff needed to address the MIC cores. Which I have absolutely no
>> > clue about.
>> >
>> > I doubt this will be very efficient, though.
>> >
>> > E.

[Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016
- Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, paul zimmermann, 01/07/2016
  - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, Emmanuel Thomé, 01/07/2016
    - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016
      - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, Emmanuel Thomé, 01/07/2016
        
        Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark, ƦOB COASTN, 01/07/2016

List archive

Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark