Skip to Content.
Sympa Menu

cado-nfs - Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark

Subject: Discussion related to cado-nfs

List archive

Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark


Chronological Thread 
  • From: ƦOB COASTN <robertpancoast77@gmail.com>
  • To: Emmanuel Thomé <Emmanuel.Thome@inria.fr>
  • Cc: cado-nfs-discuss@lists.gforge.inria.fr
  • Subject: Re: [Cado-nfs-discuss] cado-nfs-2.2.0 RSA-120 MIC Benchmark
  • Date: Thu, 7 Jan 2016 11:00:28 -0500
  • Authentication-results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=robertpancoast77@gmail.com; spf=Pass smtp.mailfrom=robertpancoast77@gmail.com; spf=None smtp.helo=postmaster@mail-lb0-f170.google.com
  • Ironport-phdr: 9a23:IlyuRxEl4qs+EBCDD2M/AZ1GYnF86YWxBRYc798ds5kLTJ75r8mwAkXT6L1XgUPTWs2DsrQf27SQ7PurCTVIyK3CmU5BWaQEbwUCh8QSkl5oK+++Imq/EsTXaTcnFt9JTl5v8iLzG0FUHMHjew+a+SXqvnYsExnyfTB4Ov7yUtaLyZ/niabtoNaDOk1hv3mUX/BbFF2OtwLft80b08NJC50a7V/3mEZOYPlc3mhyJFiezF7W78a0+4N/oWwL46pyv+YJa6jxfrw5QLpEF3xmdjltvIy4/SXEGCaK43IaT2gSpSZIBA1EpEXXW5L4tDb3sqxB2C6fMOX3S6o1UHKs9fE4ZgXvjXLgEBQO20b+sfBWxPZBpxisvQBnyojfZ4iOKKUhVqzYdNIeA2FGW5ACBGR6HoqgYt5XXKI6NuFCoty4/gNWoA==
  • List-archive: <http://lists.gforge.inria.fr/pipermail/cado-nfs-discuss/>
  • List-id: A discussion list for Cado-NFS <cado-nfs-discuss.lists.gforge.inria.fr>

Hello,

Regarding efficiency, keep in mind that 512 SIMD intrinsics are
available for the MIC, buts its different binary from AVX-512. Also
there exists a 512-bit fused multiply-add operation available for
exploitation. The logs show CPU time is much higher, but there are
many more parallel threads per compute node. In my opinion, per
thread performance is worse; but more importantly, the total system
schedule for a single job is improved. I predict with the launch of
Knights Landing comes a significant drop in cost of co-processor
systems. Colfax International presents a strong use case scenario for
1x performance on Xeon PHI. "...ratio of performance to consumed
electrical power and the ratio of performance to purchasing system
cost, both under the assumption of linear parallel scalability of the
application." I think its safe to assume that future cloud systems
will operate with co-processor nodes. Nadia Heninger presents an
interesting point in her paper, HPC hardware can essentially be leased
by the hour. If MIC accelerated cado-nfs scales linearly, what
happens with cloud systems that include the MIC co-processors? The
current state software is highly parallel, and there are many factors
ready for a DoE, I want to get some allocations on the Chameleon Cloud
to test this hypothesis. However the process bottleneck at the linear
algebra stage, must first be resolved.

https://chameleoncloud.org/about/chameleon/
"experimentation with high-memory, large-disk, low-power, GPU, and
co-processor units."

Below is some timing data for c140, illustrating the bottleneck.

E5-2603v3-MIC5110P-20160106195503.txt
started script @ Wed 6th: 7:55pm minutes task
201601061955 - 201601062001 6 Client Launcher
201601062001 - 201601062101 60 polyselect1
201601062101 - 201601062154 53 polyselect2
201601062154 - 201601062154 1 makefb
201601062154 - 201601062155 1 freerel
201601062155 - 201601070506 431 las
201601070506 - 201601070533 27 filter
201601070533 - estimated 1800 linalg
estimated 60 sqrt
estimated c140 Total 2439 40.65 hrs

Thanks,
Rob

On Thu, Jan 7, 2016 at 7:48 AM, Emmanuel Thomé <Emmanuel.Thome@inria.fr>
wrote:
> On Thu, Jan 07, 2016 at 09:01:12AM +0100, Paul Zimmermann wrote:
>> Rob,
>>
>> > Using bridged networking and 8 MIC co-processors, hybrid Xeon and Xeon
>> > Phi execution is feasible. Ten factor trials of RSA-120 finished in
>> > under 2 hours.
>> > (1.6337+1.6400+1.6357+1.7245+1.6255+1.7062+1.6105+1.6214+1.6021+1.6321)/10=1.64317
>> > hrs
>>
>> I've just added timings on http://cado-nfs.gforge.inria.fr/ with 2.2.0:
>> on a dual 8-core Intel(R) Xeon(R) CPU E5-2650 at 2.00GHz, factoring
>> RSA-120 with CADO-NFS 2.2.0 takes about 2.2 hours of wall clock time
>> (32.2 hours of cpu time). You used about 1200 hours cpu.
>>
>> > 1.) Will MPI enable utilization of the 1920 MIC cores for linalg?
>>
>> I let Emmanuel answer to that.
>
> probably, yes. Depends on which mpi implementation you intend to use. But
> one may imagine that with proper mpi incantations, you may run one thread
> per MIC core.
>
> This would mean passing mpi=48x40 to bwc.pl, and pass in mpi_extra_args
> all the stuff needed to address the MIC cores. Which I have absolutely no
> clue about.
>
> I doubt this will be very efficient, though.
>
> E.




Archive powered by MHonArc 2.6.19+.

Top of Page