math-atlas-devel Mailing List for Automatically Tuned Linear Algebra Soft.

Brought to you by: rwhaley, tonyc040457

math-atlas-devel — ATLAS developers' list

You can subscribe to this list here.

2001	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep (8)	_Oct (17)	_Nov (29)	_Dec (30)
2002	_Jan (19)	_Feb (19)	_Mar (29)	_Apr (3)	_May (38)	_Jun (14)	_Jul (6)	_Aug (7)	_Sep (12)	_Oct (6)	_Nov (9)	_Dec
2003	_Jan (6)	_Feb (5)	_Mar (8)	_Apr (10)	_May (4)	_Jun (11)	_Jul (5)	_Aug (3)	_Sep (12)	_Oct (1)	_Nov (9)	_Dec (45)
2004	_Jan (7)	_Feb (6)	_Mar (4)	_Apr (7)	_May (7)	_Jun (30)	_Jul (7)	_Aug (6)	_Sep (1)	_Oct (4)	_Nov (18)	_Dec (25)
2005	_Jan (11)	_Feb (10)	_Mar (3)	_Apr (7)	_May	_Jun	_Jul (1)	_Aug (29)	_Sep (6)	_Oct (8)	_Nov (2)	_Dec (5)
2006	_Jan	_Feb (16)	_Mar (2)	_Apr (9)	_May (15)	_Jun (24)	_Jul (10)	_Aug (39)	_Sep (20)	_Oct (8)	_Nov (30)	_Dec (28)
2007	_Jan (1)	_Feb (19)	_Mar (11)	_Apr (3)	_May (12)	_Jun (7)	_Jul (20)	_Aug (9)	_Sep (7)	_Oct (7)	_Nov (8)	_Dec (6)
2008	_Jan (3)	_Feb (8)	_Mar	_Apr	_May (7)	_Jun (16)	_Jul (38)	_Aug (11)	_Sep (6)	_Oct (2)	_Nov	_Dec (4)
2009	_Jan (6)	_Feb (25)	_Mar (13)	_Apr (5)	_May	_Jun	_Jul (1)	_Aug (8)	_Sep (16)	_Oct (17)	_Nov (2)	_Dec (1)
2010	_Jan (3)	_Feb (3)	_Mar (2)	_Apr (5)	_May	_Jun (2)	_Jul	_Aug	_Sep	_Oct (16)	_Nov (53)	_Dec (7)
2011	_Jan (10)	_Feb (37)	_Mar (30)	_Apr (12)	_May (5)	_Jun (14)	_Jul (7)	_Aug (8)	_Sep (37)	_Oct (3)	_Nov (5)	_Dec (60)
2012	_Jan (25)	_Feb (5)	_Mar (4)	_Apr (7)	_May (12)	_Jun (28)	_Jul (28)	_Aug (2)	_Sep (5)	_Oct (6)	_Nov	_Dec (17)
2013	_Jan (18)	_Feb (10)	_Mar (30)	_Apr (21)	_May	_Jun (10)	_Jul (8)	_Aug	_Sep (39)	_Oct (54)	_Nov (8)	_Dec (6)
2014	_Jan (17)	_Feb (14)	_Mar (16)	_Apr (67)	_May (2)	_Jun (8)	_Jul (7)	_Aug (9)	_Sep (6)	_Oct (9)	_Nov (12)	_Dec
2015	_Jan (5)	_Feb (9)	_Mar (1)	_Apr (2)	_May	_Jun (1)	_Jul (2)	_Aug (6)	_Sep (1)	_Oct (1)	_Nov	_Dec (3)
2016	_Jan	_Feb	_Mar	_Apr	_May	_Jun (3)	_Jul (22)	_Aug	_Sep (1)	_Oct	_Nov (21)	_Dec
2017	_Jan (20)	_Feb	_Mar (2)	_Apr	_May	_Jun (8)	_Jul	_Aug (1)	_Sep	_Oct	_Nov	_Dec
2018	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (3)	_Nov	_Dec

Flat | Threaded

1 2 3 .. 81 > >> (Page 1 of 81)

Re: [atlas-devel] 3.11.41

From: Fulton, B. <bef...@iu...> - 2018年10月10日 13:30:53

Attachments: smime.p7s

I've built this for IU's Carbonate cluster. I'll test it more later, but the
"make install" appeared to want a recursive copy when installing the include
files, so I added that flag. I also tried to run "make time" on a couple of
nodes with slightly different configurations, but it appeared to return the
exact same values - is there a "make timeclean" or some equivalent I could
run?
--
Ben Fulton
Research Technologies
Scientific Applications and Performance Tuning
Indiana University
E-Mail: bef...@iu...
-----Original Message-----
From: R. Clint Whaley <rcw...@iu...> 
Sent: Friday, October 5, 2018 3:30 AM
To: List for developer discussion, NOT SUPPORT.
<mat...@li...>
Subject: [atlas-devel] 3.11.41
I have released 3.11.41. It is a bugfix release, fixing rotmg, assembly
errors on POWER, and a performance regression in small triangle TRMM.
Cheers,
Clint
ATLAS 3.11.41 released 10/05/18, highlights of changes from 3.11.40:
 * Fixed bug in drotmg: https://sourceforge.net/p/math-atlas/bugs/256/
 * Fixed assembly errors for POWER9 (failure to save correct regs)
 * Fixed performance regression for small triangle TRMM
--
******************************************
** R. Clint Whaley, PhD, Assoc Prof, IU
** http://homes.soic.indiana.edu/rcwhaley/
******************************************
_______________________________________________
Math-atlas-devel mailing list
Mat...@li...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

[atlas-devel] 3.11.41

From: R. C. W. <rcw...@iu...> - 2018年10月05日 07:30:36

I have released 3.11.41. It is a bugfix release, fixing rotmg, assembly 
errors on POWER, and a performance regression in small triangle TRMM.
Cheers,
Clint
ATLAS 3.11.41 released 10/05/18, highlights of changes from 3.11.40:
 * Fixed bug in drotmg: https://sourceforge.net/p/math-atlas/bugs/256/
 * Fixed assembly errors for POWER9 (failure to save correct regs)
 * Fixed performance regression for small triangle TRMM
-- 
******************************************
** R. Clint Whaley, PhD, Assoc Prof, IU
** http://homes.soic.indiana.edu/rcwhaley/
******************************************

[atlas-devel] 3.11.40: I am not dead

From: R. C. W. <rcw...@iu...> - 2018年10月03日 00:53:40

Guys,
Sorry to spam both lists and any dups that causes, but since it has 
looked like I've retired, I'm sending this to atlas-devel & announce.
3.11.40 has finally been released. I have actually been working on it 
for most of this time, but, with the move to Indiana factored in, it has 
taken me this long to get the framework working again!
The reason is that we have essentially rewritten the entire way 
microkernels are tuned and accessed in the library. Therefore, the 
majority of tuning code has been touched or rewritten, and since this 
includes all the generation, etc, it took a long while to get things at 
all reliable.
The end goal is that increased microkernel specialization should greatly 
increase our weird-shape and parallel scaling performance.
Right now, you will hopefully see much better serial non-GEMM BLAS 
performance (eg., small-triangle TRSM or TRMM, for instance). Very 
large problems aren't likely to have a huge difference, if prior 
releases supported your architecture well (eg., we've added AVX-512 to 
the code generators, which obviously will hugely improve SkylakeX 
asymptotic performance).
The installs have gone from long to endless, unfortunately. I will fix 
this before stable, but right now searches are all brute-force and 
ignorance while we concentrate on getting the last of the microkernel 
handling solidified. I will attempt to speed up search later, and allow 
for a "no-timing" install from archdefs, so that people on 
already-supported platforms can skip most or all of the tuning (a 
feature many maintainers have long wanted).
For now, terrible install times will just be a feature until we finish 
debugging and publish the new BLAS approach.
The major weakness in the install when ran on arbitrary machines right 
now (other than time) is in some new cache detection code that creates a 
file called atlas_cache.h. This code dies on several machines, and I 
haven't had time to track down details. However, if it fails for you, 
open up a tracker item and I can tell you how to proceed beyond it even 
before fixing the code in question.
Hopefully, this release should be purely faster than any other that came 
before, but if you spot performance regressions, please let us know. We 
are not yet always using the correct microkernel (even when the library 
has built it), because our selection algorithm work is awaiting the 
finishing of the new tuning strategy.
Eventually, ATLAS will be able to not only tune microkernels to make the 
BLAS/LAPACK, but specialized operations for people wanting to avoid BLAS 
overheads (at cost of calling messy microkernels; think of things like 
tensor algebra with very small shapes that need to scale, perhaps 
machine learning, etc.). This allows you to have detailed cache control 
necessary to scale when the problem size isn't large enough to dominate 
low-order terms, and thus make BLAS API OK.
ChangeLog (which has almost no detail on massive changes) is below.
Cheers,
Clint
ATLAS 3.11.40 released 10/02/18, highlights of changes from 3.11.39:
 * Basically a rewrite of all L3BLAS and LAPACK tuning framework:
 + Complete rewrite of all searches to allow different "views" of 
kernels
 for maximum performance for all-BLAS usage; present 
implementation very
 slow even with archdefs, will need to be speed up before stable
 + Complete rewrite of gemm kernel choice mechanism
 + Complete rewrite of all BLAS handling for much improved 
small/medium perf
 via greater use of microkernels
 * Addition of core count to archdefs, because this usually increases 
block
 factors when maximizing performance
 * Addition of -ansi flag to avoid C changes borking include files
 * Archdef support for host of modern Intel/AMD + POWER9:
 - Corei264AVXp16, Corei3EP64AVXMACp36, Corei4X64AVXZp18,
 - AMD64K10h64SSE3p32, AMDRyzen64AVXMACp[8,16,64]
 - ARM64xgene164p8, ARM64thund64p48
 - POWER964LEVSXp8
 * Addition of cpuid-based cache detection for Intel & AMD x86 machines
 - Presently gets wrong answer on some machines, where shared caches
 are either multiplied or divided by P inappropriately
 * Beginning of rewrite of generic cache detection
 * Fixed bug where names like "c99-gcc" preferred over "gcc"
 * Added -Si indthr 1 option to autoprobe for aliased thread IDs
 + Presently, only supported on ARM64 & x86 with at least SSE2
 * Complete rewrite of gemm kernel indexing to compact data structures
 and minimize cache pollution

[atlas-devel] move reason for no reply

From: R. C. W. <rcw...@iu...> - 2017年08月17日 13:15:37

Guys,
I am now at Indiana University, having just completed my move, and am 
presently preparing to teach next week. This is a reason for the delay 
in responding to the several 3.10 patches/questions just recently. I am 
keeping your e-mails, and will respond as soon as I get on top of the 
new place and its processes.
The recent delay in developer releases is because I rewrote my 
microkernel handling for greater efficiency, and it has taken a *long* 
time to get it working again. We are presently working on greatly 
improving our non-GEMM small-case performance, which I think is going to 
be worth the wait when I get it out.
Anyway, I'm still working on both stable & developer, and will respond 
as soon as I can.
Cheers,
Clint
-- 
******************************************
** R. Clint Whaley, PhD, Assoc Prof, IU
** http://homes.soic.indiana.edu/rcwhaley/
******************************************

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: R. C. W. <wh...@my...> - 2017年06月30日 00:14:15

>
> The implementation of HT has improved over the years, so please don't 
> assume results obtained on older processors are applicable to the 
> current ones. I used to be a HT skeptic but almost everything runs 
> faster with them on Haswell and later, particularly the client parts 
> (i.e. Core series as opposed to Xeon).
Unless they have changed the definition of what HT does, I do not see a 
theoretical way to avoid the cache problem.
>
> You might try running an actual application, where you get a mix
> of kernels. This tends to stress the cache more, and can
> sometimes expose the downside of HT.
>
>
> On the other hand, idle HTs help with OS interrupts and other stuff 
> that happens quite a bit in an HPC environment once one starts using 
> MPI etc. This is one of the reasons I encourage everyone to enable HT 
> in the BIOS even if their applications don't use them.
If the OS interrupts, its interrupting all threads, so I don't think I'm 
following this line of thought. Maybe you mean that if you have a huge 
stack of threads to be run, using HT you have 2 or 4 slots to round 
robin into once interrupted?
>
> I remember finding slight speedup in some case leading me to think
> HT was helpful, but then I had performance collapses other places,
> which led to me to recommend turning it off (or using affinity to
> avoid it, like MKL is doing, if you can't turn it off) to maximize
> performance.
>
>
> If nothing else, HT doubles the number of threads, which hurts any 
> part of a code that scales poorly, and it makes it harder to manage 
> affinity. I had to spend quite a bit of time helping users with SMT 
> (2-4 HW threads per core) on Blue Gene/Q in my old job.
>
> So, for instance, take LAPACK or ATLAS LU or QR (or your own
> version) and hook them up to the two BLAS. Does the non-MKL
> HT-liking kernel get anywhere close to MKL performance despite
> it's gemm looking as good with HT, or does it collapse its
> performance while MKL maintains?
>
>
> I don't have test driver for those already so I'm afraid I'm not going 
> to punt on those experiments. However, if somebody else posts the 
> code, I'll certainly run it and post results for generally available 
> hardware.
ATLAS comes with timers for any or all of these. They are built to time 
other's libs too.
For instance, set BLASlib to MKL, set FLAPACKlib to your f77 lapack, and 
"make xdtlatime_fl_sb" will time using MKL + LAPACK. Switch BLASlib to 
bliss now, remake, voila.
> My guess is the MKL group got the same "HT not-reliable, non-HT
> is" results, and that's why its behaving in this way.
>
>
> Maybe. In any case, it simplifies the design space to not have to 
> think about >1 threads sharing an L1.
L1 is not the problem on modern machines. As you scale like with Xeon-E 
series you need to use every scrap of cache, including shared. If you 
use the full scale of something like 12-cores per shared cache, I 
believe you will see substantial slowdowns from HT.
Cheers,
Clint
>
> Jeff
>
> Thanks for results!
> Clint
>
> On 06/29/2017 05:56 PM, Hammond, Jeff R wrote:
>
> Good catch. strace shows only 35 calls to clone in both cases
> with MKL. I didn’t know that MKL was doing these tricks.
>
> However, I tested another DGEMM implementation that supports
> AVX2 and it uses all of the HTs and it performs on par with
> MKL, but only when HT is used.
>
> Jeff
>
>
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
> KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x
> 2>&1 | head -n5000 | grep -c clone
> 71
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
> KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x
> 2>&1 | head -n5000 | grep -c clone
> 35
>
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
> KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep
> -v "%"
> blis_dgemm_nn_rrr 384 384 384 204.027 
> 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 650.820 
> 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 816.355 
> 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 835.650 
> 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 832.179 
> 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 863.123 
> 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 844.502 
> 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 860.262 
> 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 851.694 
> 5.80e-18 PASS
> blis_dgemm_nn_rrr 3840 3840 3840 856.526 
> 6.79e-18 PASS
>
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
> KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep
> -v "%"
> blis_dgemm_nn_rrr 384 384 384 161.331 
> 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 437.967 
> 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 545.498 
> 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 616.338 
> 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 606.650 
> 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 611.153 
> 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 603.314 
> 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 631.292 
> 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 625.833 
> 5.80e-18 PASS
>
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
> KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep
> -v "%"
> blis_dgemm_nn_rrr 384 384 384 159.789 
> 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 443.810 
> 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 536.077 
> 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 596.069 
> 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 595.763 
> 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 616.531 
> 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 591.823 
> 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 615.153 
> 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 621.714 
> 5.80e-18 PASS
>
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
> KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep
> -v "%"
> blis_dgemm_nn_rrr 384 384 384 189.615 
> 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 423.504 
> 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 445.424 
> 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 444.830 
> 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 442.893 
> 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 445.979 
> 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 445.694 
> 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 451.026 
> 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 454.909 
> 5.80e-18 PASS
>
>
> On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley
> <rcw...@ls...
> <mailto:rcw...@ls...><mailto:rcw...@ls...
> <mailto:rcw...@ls...>>> wrote:
> Jeff,
>
> Have you run a thread monitor to see if MKL is simply not
> using the hyperthreading regardless of whether it is on or off
> in BIOS?
>
> You also may want to try something like LU.
>
> Cheers,
> Clint
>
>
> On 06/29/2017 05:15 PM, Jeff Hammond wrote:
> I don't see any negative impact from using HT relative to not
> using HT, at
> least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5%
> gain here is
> irrelevant and may be due to thermal effects (this box is in
> my cubicle,
> not an air-conditioned machine room).
>
> $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765
> Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930
>
> HT on
>
> $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073
> Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853
>
> I would be interested to see folks post data to support the
> argument
> against HT.
>
> Jeff
>
> On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
> mat...@li...
> <mailto:mat...@li...><mailto:mat...@li...
> <mailto:mat...@li...>>> wrote:
>
> Thank you very much for quick response. Just to check if my
> understanding
> is correct :
>
> 1. By turning off cpuid in bios, I only need to use -t N to
> build Atlas
> right?
>
> 2. The N in -t N is the total number of threads on the
> machine, not per
> Cpu right ?
>
> 3. One more question I have is, how to set the correct -t N
> for mpi based
> application.
> Let's say on the 2-cpu machine with 4 cores per CPU,
> should I use -t
> 4 or -t 8 if I rum my application with 2 mpi processes :
> mpirun -n 2 myprogram
>
> Many thanks !
>
> Sent from Yahoo Mail on Android
>
> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
> <wh...@my...
> <mailto:wh...@my...><mailto:wh...@my...
> <mailto:wh...@my...>>> wrote:
> Hyperthreading is an optimization aimed at addressing poorly
> optimized
> code. The idea is that most codes cannot drive the backend
> hardware
> (ALU/FPU, etc) at the maximal rate, so if you duplicate
> registers you
> can, amongst several threads, find enough work to keep the
> backend busy.
>
> ATLAS (or any optimized linear algebra library) already runs
> the FPU at
> its maximal rate supported by the cache architecture after
> cache blocking.
>
> If you can already drive the backend at >90% of peak, then
> hyperthreading can actually *lose* you performance, as the
> threads bring
> conflicting data in the cache.
>
> It's usually not a night and day difference, but I haven't
> measured it
> in the huge blocking era used by recent developer releases (it
> may be
> worse there).
>
> My general recommendation is turn off hyperthreading for highly
> optimized codes, and turn it on for relatively unoptimized codes.
>
> As to which core IDs correspond to the physical cores, that
> varies by
> machine. On x86, you can use CPUID to determine that if you are
> super-knowledgeable. I usually just turn it off in the BIOS,
> because I
> don't like something that may thrash my cache running, even if
> it might
> occasionally help :)
>
> Cheers,
> Clint
>
> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> Hello,Would like go check if my understanding is correct for
> compiling
> Atlas on a machine that has multiple CPUs and hyperthreading.
> I have two types of machine:
> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core-
> 2 CPU,
> each with 8 Cores, hyperthreaded, 2 threads per core
> So when I compile Atlas, is it correct that I should use:
> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the
> affinity ID
> is from 0-7 and 0-15).
> That means the number 8 or 16 is the total cores on the
> machine, not
> number of cores per CPU. Am I correct ?
> I also read somewhere saying that Atlas supports
> Hyperthreading. What
> does this mean ?
> Does this mean:1. I do not need to disable hyperthreading in
> BIOS (no
> performance difference whether it is enabled or disabled, as
> long as the
> number of threads and affinity IDs are set correctly when
> compiling
> Atlas)2. Or I can make use of the hyperthread, that is, -tl 16
> and -tl 32 ?
> Thank you very much,
> lixin
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> <mailto:Mat...@li...><mailto:Mat...@li...
> <mailto:Mat...@li...>>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> <mailto:Mat...@li...><mailto:Mat...@li...
> <mailto:Mat...@li...>>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> <mailto:Mat...@li...><mailto:Mat...@li...
> <mailto:Mat...@li...>>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
> --
> Jeff Hammond
> jef...@gm...
> <mailto:jef...@gm...><mailto:jef...@gm...
> <mailto:jef...@gm...>>
> http://jeffhammond.github.io/
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> <mailto:Mat...@li...><mailto:Mat...@li...
> <mailto:Mat...@li...>>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
> --
> **********************************************************************
> ** R. Clint Whaley, PhD * Assoc Prof, LSU *
> www.csc.lsu.edu/~whaley
> <http://www.csc.lsu.edu/%7Ewhaley><http://www.csc.lsu.edu/~whaley
> <http://www.csc.lsu.edu/%7Ewhaley>> **
> **********************************************************************
>
>
>
>
> --
> Jeff Hammond
> jef...@gm...
> <mailto:jef...@gm...><mailto:jef...@gm...
> <mailto:jef...@gm...>>
> http://jeffhammond.github.io/
>
>
>
> -- 
> **********************************************************************
> ** R. Clint Whaley, PhD * Assoc Prof, LSU *
> www.csc.lsu.edu/~whaley <http://www.csc.lsu.edu/%7Ewhaley> **
> **********************************************************************
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> <mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
>
> -- 
> Jeff Hammond
> jef...@gm... <mailto:jef...@gm...>
> http://jeffhammond.github.io/
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: Jeff H. <jef...@gm...> - 2017年06月29日 23:33:40

On Thu, Jun 29, 2017 at 4:10 PM, R. Clint Whaley <rcw...@ls...> wrote:
> Yeah, if it can't get that perf w/o hyperthreading, its not fully tuned.
>
>
Agreed. BLIS is just a framework and I'm using the default blocking
parameters. I know from discussions with Greg Henry that scaling all the
way out on the high-core-count Xeon processors requires some algorithm
changes. I expect that if I play around with the knobs of BLIS, it will
perform optimally with 1 HT per core.
> Back in day when I investigated HT, the problem really is in cache
> stomping, as two threads compete for the same cache. This makes the
> effects unpredictable (if the cache wasn't being fully utilized, maybe no
> effect, if you get lucky on the replacement, maybe tiny effect, and if you
> get unlucky, an truly bad dropoff).
>
>
The implementation of HT has improved over the years, so please don't
assume results obtained on older processors are applicable to the current
ones. I used to be a HT skeptic but almost everything runs faster with
them on Haswell and later, particularly the client parts (i.e. Core series
as opposed to Xeon).
> You might try running an actual application, where you get a mix of
> kernels. This tends to stress the cache more, and can sometimes expose the
> downside of HT.
>
>
On the other hand, idle HTs help with OS interrupts and other stuff that
happens quite a bit in an HPC environment once one starts using MPI etc.
This is one of the reasons I encourage everyone to enable HT in the BIOS
even if their applications don't use them.
> I remember finding slight speedup in some case leading me to think HT was
> helpful, but then I had performance collapses other places, which led to me
> to recommend turning it off (or using affinity to avoid it, like MKL is
> doing, if you can't turn it off) to maximize performance.
>
>
If nothing else, HT doubles the number of threads, which hurts any part of
a code that scales poorly, and it makes it harder to manage affinity. I
had to spend quite a bit of time helping users with SMT (2-4 HW threads per
core) on Blue Gene/Q in my old job.
> So, for instance, take LAPACK or ATLAS LU or QR (or your own version) and
> hook them up to the two BLAS. Does the non-MKL HT-liking kernel get
> anywhere close to MKL performance despite it's gemm looking as good with
> HT, or does it collapse its performance while MKL maintains?
>
>
I don't have test driver for those already so I'm afraid I'm not going to
punt on those experiments. However, if somebody else posts the code, I'll
certainly run it and post results for generally available hardware.
> My guess is the MKL group got the same "HT not-reliable, non-HT is"
> results, and that's why its behaving in this way.
>
>
Maybe. In any case, it simplifies the design space to not have to think
about >1 threads sharing an L1.
Jeff
> Thanks for results!
> Clint
>
> On 06/29/2017 05:56 PM, Hammond, Jeff R wrote:
>
>> Good catch. strace shows only 35 calls to clone in both cases with MKL.
>> I didn’t know that MKL was doing these tricks.
>>
>> However, I tested another DGEMM implementation that supports AVX2 and it
>> uses all of the HTs and it performs on par with MKL, but only when HT is
>> used.
>>
>> Jeff
>>
>>
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>> KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x 2>&1 |
>> head -n5000 | grep -c clone
>> 71
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>> KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x 2>&1 |
>> head -n5000 | grep -c clone
>> 35
>>
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>> KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
>> blis_dgemm_nn_rrr 384 384 384 204.027 8.27e-18
>> PASS
>> blis_dgemm_nn_rrr 768 768 768 650.820 5.36e-18
>> PASS
>> blis_dgemm_nn_rrr 1152 1152 1152 816.355 4.40e-18
>> PASS
>> blis_dgemm_nn_rrr 1536 1536 1536 835.650 7.02e-18
>> PASS
>> blis_dgemm_nn_rrr 1920 1920 1920 832.179 9.96e-18
>> PASS
>> blis_dgemm_nn_rrr 2304 2304 2304 863.123 6.28e-18
>> PASS
>> blis_dgemm_nn_rrr 2688 2688 2688 844.502 8.28e-18
>> PASS
>> blis_dgemm_nn_rrr 3072 3072 3072 860.262 9.92e-18
>> PASS
>> blis_dgemm_nn_rrr 3456 3456 3456 851.694 5.80e-18
>> PASS
>> blis_dgemm_nn_rrr 3840 3840 3840 856.526 6.79e-18
>> PASS
>>
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>> KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
>> blis_dgemm_nn_rrr 384 384 384 161.331 8.27e-18
>> PASS
>> blis_dgemm_nn_rrr 768 768 768 437.967 5.36e-18
>> PASS
>> blis_dgemm_nn_rrr 1152 1152 1152 545.498 4.40e-18
>> PASS
>> blis_dgemm_nn_rrr 1536 1536 1536 616.338 7.02e-18
>> PASS
>> blis_dgemm_nn_rrr 1920 1920 1920 606.650 9.96e-18
>> PASS
>> blis_dgemm_nn_rrr 2304 2304 2304 611.153 6.28e-18
>> PASS
>> blis_dgemm_nn_rrr 2688 2688 2688 603.314 8.28e-18
>> PASS
>> blis_dgemm_nn_rrr 3072 3072 3072 631.292 9.92e-18
>> PASS
>> blis_dgemm_nn_rrr 3456 3456 3456 625.833 5.80e-18
>> PASS
>>
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>> KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
>> blis_dgemm_nn_rrr 384 384 384 159.789 8.27e-18
>> PASS
>> blis_dgemm_nn_rrr 768 768 768 443.810 5.36e-18
>> PASS
>> blis_dgemm_nn_rrr 1152 1152 1152 536.077 4.40e-18
>> PASS
>> blis_dgemm_nn_rrr 1536 1536 1536 596.069 7.02e-18
>> PASS
>> blis_dgemm_nn_rrr 1920 1920 1920 595.763 9.96e-18
>> PASS
>> blis_dgemm_nn_rrr 2304 2304 2304 616.531 6.28e-18
>> PASS
>> blis_dgemm_nn_rrr 2688 2688 2688 591.823 8.28e-18
>> PASS
>> blis_dgemm_nn_rrr 3072 3072 3072 615.153 9.92e-18
>> PASS
>> blis_dgemm_nn_rrr 3456 3456 3456 621.714 5.80e-18
>> PASS
>>
>> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>> KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
>> blis_dgemm_nn_rrr 384 384 384 189.615 8.27e-18
>> PASS
>> blis_dgemm_nn_rrr 768 768 768 423.504 5.36e-18
>> PASS
>> blis_dgemm_nn_rrr 1152 1152 1152 445.424 4.40e-18
>> PASS
>> blis_dgemm_nn_rrr 1536 1536 1536 444.830 7.02e-18
>> PASS
>> blis_dgemm_nn_rrr 1920 1920 1920 442.893 9.96e-18
>> PASS
>> blis_dgemm_nn_rrr 2304 2304 2304 445.979 6.28e-18
>> PASS
>> blis_dgemm_nn_rrr 2688 2688 2688 445.694 8.28e-18
>> PASS
>> blis_dgemm_nn_rrr 3072 3072 3072 451.026 9.92e-18
>> PASS
>> blis_dgemm_nn_rrr 3456 3456 3456 454.909 5.80e-18
>> PASS
>>
>>
>> On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley <rcw...@ls...
>> <mailto:rcw...@ls...>> wrote:
>> Jeff,
>>
>> Have you run a thread monitor to see if MKL is simply not using the
>> hyperthreading regardless of whether it is on or off in BIOS?
>>
>> You also may want to try something like LU.
>>
>> Cheers,
>> Clint
>>
>>
>> On 06/29/2017 05:15 PM, Jeff Hammond wrote:
>> I don't see any negative impact from using HT relative to not using HT, at
>> least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% gain here is
>> irrelevant and may be due to thermal effects (this box is in my cubicle,
>> not an air-conditioned machine room).
>>
>> $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
>> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
>> Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765
>> Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930
>>
>> HT on
>>
>> $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
>> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
>> Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073
>> Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853
>>
>> I would be interested to see folks post data to support the argument
>> against HT.
>>
>> Jeff
>>
>> On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
>> mat...@li...<mailto:math-atlas-
>> de...@li...>> wrote:
>>
>> Thank you very much for quick response. Just to check if my understanding
>> is correct :
>>
>> 1. By turning off cpuid in bios, I only need to use -t N to build Atlas
>> right?
>>
>> 2. The N in -t N is the total number of threads on the machine, not per
>> Cpu right ?
>>
>> 3. One more question I have is, how to set the correct -t N for mpi based
>> application.
>> Let's say on the 2-cpu machine with 4 cores per CPU, should I use
>> -t
>> 4 or -t 8 if I rum my application with 2 mpi processes :
>> mpirun -n 2 myprogram
>>
>> Many thanks !
>>
>> Sent from Yahoo Mail on Android
>>
>> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
>> <wh...@my...<mailto:wh...@my...>> wrote:
>> Hyperthreading is an optimization aimed at addressing poorly optimized
>> code. The idea is that most codes cannot drive the backend hardware
>> (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you
>> can, amongst several threads, find enough work to keep the backend busy.
>>
>> ATLAS (or any optimized linear algebra library) already runs the FPU at
>> its maximal rate supported by the cache architecture after cache blocking.
>>
>> If you can already drive the backend at >90% of peak, then
>> hyperthreading can actually *lose* you performance, as the threads bring
>> conflicting data in the cache.
>>
>> It's usually not a night and day difference, but I haven't measured it
>> in the huge blocking era used by recent developer releases (it may be
>> worse there).
>>
>> My general recommendation is turn off hyperthreading for highly
>> optimized codes, and turn it on for relatively unoptimized codes.
>>
>> As to which core IDs correspond to the physical cores, that varies by
>> machine. On x86, you can use CPUID to determine that if you are
>> super-knowledgeable. I usually just turn it off in the BIOS, because I
>> don't like something that may thrash my cache running, even if it might
>> occasionally help :)
>>
>> Cheers,
>> Clint
>>
>> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
>> Hello,Would like go check if my understanding is correct for compiling
>> Atlas on a machine that has multiple CPUs and hyperthreading.
>> I have two types of machine:
>> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU,
>> each with 8 Cores, hyperthreaded, 2 threads per core
>> So when I compile Atlas, is it correct that I should use:
>> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID
>> is from 0-7 and 0-15).
>> That means the number 8 or 16 is the total cores on the machine, not
>> number of cores per CPU. Am I correct ?
>> I also read somewhere saying that Atlas supports Hyperthreading. What
>> does this mean ?
>> Does this mean:1. I do not need to disable hyperthreading in BIOS (no
>> performance difference whether it is enabled or disabled, as long as the
>> number of threads and affinity IDs are set correctly when compiling
>> Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32
>> ?
>> Thank you very much,
>> lixin
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>
>>
>>
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...<mailto:Math-atlas-
>> de...@li...>
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...<mailto:Math-atlas-
>> de...@li...>
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...<mailto:Math-atlas-
>> de...@li...>
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>>
>>
>> --
>> Jeff Hammond
>> jef...@gm...<mailto:jef...@gm...>
>> http://jeffhammond.github.io/
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>
>>
>>
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...<mailto:Math-atlas-
>> de...@li...>
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>> --
>> **********************************************************************
>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley<
>> http://www.csc.lsu.edu/~whaley> **
>> **********************************************************************
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> jef...@gm...<mailto:jef...@gm...>
>> http://jeffhammond.github.io/
>>
>>
>>
> --
> **********************************************************************
> ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
> **********************************************************************
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>
-- 
Jeff Hammond
jef...@gm...
http://jeffhammond.github.io/

[atlas-devel] Fwd: Re: Compiling Atlas with hyperthreading

From: R. C. W. <rcw...@ls...> - 2017年06月29日 23:14:53

just realized my reply only went to Jeff.
-------- Forwarded Message --------
Subject: Re: [atlas-devel] Compiling Atlas with hyperthreading
Date: 2017年6月29日 17:22:05 -0500
From: R. Clint Whaley <rcw...@ls...>
To: Jeff Hammond <jef...@gm...>
Jeff,
Have you run a thread monitor to see if MKL is simply not using the 
hyperthreading regardless of whether it is on or off in BIOS?
You also may want to try something like LU.
Cheers,
Clint
On 06/29/2017 05:15 PM, Jeff Hammond wrote:
> I don't see any negative impact from using HT relative to not using HT, at
> least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% gain here is
> irrelevant and may be due to thermal effects (this box is in my cubicle,
> not an air-conditioned machine room).
> 
> $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765
> Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930
> 
> HT on
> 
> $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073
> Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853
> 
> I would be interested to see folks post data to support the argument
> against HT.
> 
> Jeff
> 
> On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
> mat...@li...> wrote:
>>
>> Thank you very much for quick response. Just to check if my understanding
> is correct :
>>
>> 1. By turning off cpuid in bios, I only need to use -t N to build Atlas
> right?
>>
>> 2. The N in -t N is the total number of threads on the machine, not per
> Cpu right ?
>>
>> 3. One more question I have is, how to set the correct -t N for mpi based
> application.
>> Let's say on the 2-cpu machine with 4 cores per CPU, should I use -t
> 4 or -t 8 if I rum my application with 2 mpi processes :
>> mpirun -n 2 myprogram
>>
>> Many thanks !
>>
>> Sent from Yahoo Mail on Android
>>
>> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
>> <wh...@my...> wrote:
>> Hyperthreading is an optimization aimed at addressing poorly optimized
>> code. The idea is that most codes cannot drive the backend hardware
>> (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you
>> can, amongst several threads, find enough work to keep the backend busy.
>>
>> ATLAS (or any optimized linear algebra library) already runs the FPU at
>> its maximal rate supported by the cache architecture after cache blocking.
>>
>> If you can already drive the backend at >90% of peak, then
>> hyperthreading can actually *lose* you performance, as the threads bring
>> conflicting data in the cache.
>>
>> It's usually not a night and day difference, but I haven't measured it
>> in the huge blocking era used by recent developer releases (it may be
>> worse there).
>>
>> My general recommendation is turn off hyperthreading for highly
>> optimized codes, and turn it on for relatively unoptimized codes.
>>
>> As to which core IDs correspond to the physical cores, that varies by
>> machine. On x86, you can use CPUID to determine that if you are
>> super-knowledgeable. I usually just turn it off in the BIOS, because I
>> don't like something that may thrash my cache running, even if it might
>> occasionally help :)
>>
>> Cheers,
>> Clint
>>
>> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
>>> Hello,Would like go check if my understanding is correct for compiling
> Atlas on a machine that has multiple CPUs and hyperthreading.
>>> I have two types of machine:
>>> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU,
> each with 8 Cores, hyperthreaded, 2 threads per core
>>> So when I compile Atlas, is it correct that I should use:
>>> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID
> is from 0-7 and 0-15).
>>> That means the number 8 or 16 is the total cores on the machine, not
> number of cores per CPU. Am I correct ?
>>> I also read somewhere saying that Atlas supports Hyperthreading. What
> does this mean ?
>>> Does this mean:1. I do not need to disable hyperthreading in BIOS (no
> performance difference whether it is enabled or disabled, as long as the
> number of threads and affinity IDs are set correctly when compiling
> Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
>>> Thank you very much,
>>> lixin
>>>
>>>
>>>
>>>
> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>
>>>
>>>
>>> _______________________________________________
>>> Math-atlas-devel mailing list
>>> Mat...@li...
>>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>>>
>>
>>
>>
> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
>>
>>
> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Math-atlas-devel mailing list
>> Mat...@li...
>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>>
> 
> 
> 
> --
> Jeff Hammond
> jef...@gm...
> http://jeffhammond.github.io/
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
**********************************************************************

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: R. C. W. <rcw...@ls...> - 2017年06月29日 23:10:56

Yeah, if it can't get that perf w/o hyperthreading, its not fully tuned.
Back in day when I investigated HT, the problem really is in cache 
stomping, as two threads compete for the same cache. This makes the 
effects unpredictable (if the cache wasn't being fully utilized, maybe 
no effect, if you get lucky on the replacement, maybe tiny effect, and 
if you get unlucky, an truly bad dropoff).
You might try running an actual application, where you get a mix of 
kernels. This tends to stress the cache more, and can sometimes expose 
the downside of HT.
I remember finding slight speedup in some case leading me to think HT 
was helpful, but then I had performance collapses other places, which 
led to me to recommend turning it off (or using affinity to avoid it, 
like MKL is doing, if you can't turn it off) to maximize performance.
So, for instance, take LAPACK or ATLAS LU or QR (or your own version) 
and hook them up to the two BLAS. Does the non-MKL HT-liking kernel get 
anywhere close to MKL performance despite it's gemm looking as good with 
HT, or does it collapse its performance while MKL maintains?
My guess is the MKL group got the same "HT not-reliable, non-HT is" 
results, and that's why its behaving in this way.
Thanks for results!
Clint
On 06/29/2017 05:56 PM, Hammond, Jeff R wrote:
> Good catch. strace shows only 35 calls to clone in both cases with MKL. I didn’t know that MKL was doing these tricks.
> 
> However, I tested another DGEMM implementation that supports AVX2 and it uses all of the HTs and it performs on par with MKL, but only when HT is used.
> 
> Jeff
> 
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone
> 71
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone
> 35
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr 384 384 384 204.027 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 650.820 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 816.355 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 835.650 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 832.179 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 863.123 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 844.502 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 860.262 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 851.694 5.80e-18 PASS
> blis_dgemm_nn_rrr 3840 3840 3840 856.526 6.79e-18 PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr 384 384 384 161.331 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 437.967 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 545.498 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 616.338 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 606.650 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 611.153 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 603.314 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 631.292 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 625.833 5.80e-18 PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr 384 384 384 159.789 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 443.810 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 536.077 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 596.069 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 595.763 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 616.531 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 591.823 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 615.153 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 621.714 5.80e-18 PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr 384 384 384 189.615 8.27e-18 PASS
> blis_dgemm_nn_rrr 768 768 768 423.504 5.36e-18 PASS
> blis_dgemm_nn_rrr 1152 1152 1152 445.424 4.40e-18 PASS
> blis_dgemm_nn_rrr 1536 1536 1536 444.830 7.02e-18 PASS
> blis_dgemm_nn_rrr 1920 1920 1920 442.893 9.96e-18 PASS
> blis_dgemm_nn_rrr 2304 2304 2304 445.979 6.28e-18 PASS
> blis_dgemm_nn_rrr 2688 2688 2688 445.694 8.28e-18 PASS
> blis_dgemm_nn_rrr 3072 3072 3072 451.026 9.92e-18 PASS
> blis_dgemm_nn_rrr 3456 3456 3456 454.909 5.80e-18 PASS
> 
> On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley <rcw...@ls...<mailto:rcw...@ls...>> wrote:
> Jeff,
> 
> Have you run a thread monitor to see if MKL is simply not using the hyperthreading regardless of whether it is on or off in BIOS?
> 
> You also may want to try something like LU.
> 
> Cheers,
> Clint
> 
> 
> On 06/29/2017 05:15 PM, Jeff Hammond wrote:
> I don't see any negative impact from using HT relative to not using HT, at
> least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% gain here is
> irrelevant and may be due to thermal effects (this box is in my cubicle,
> not an air-conditioned machine room).
> 
> $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765
> Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930
> 
> HT on
> 
> $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
> BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
> Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073
> Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853
> 
> I would be interested to see folks post data to support the argument
> against HT.
> 
> Jeff
> 
> On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
> mat...@li...<mailto:mat...@li...>> wrote:
> 
> Thank you very much for quick response. Just to check if my understanding
> is correct :
> 
> 1. By turning off cpuid in bios, I only need to use -t N to build Atlas
> right?
> 
> 2. The N in -t N is the total number of threads on the machine, not per
> Cpu right ?
> 
> 3. One more question I have is, how to set the correct -t N for mpi based
> application.
> Let's say on the 2-cpu machine with 4 cores per CPU, should I use -t
> 4 or -t 8 if I rum my application with 2 mpi processes :
> mpirun -n 2 myprogram
> 
> Many thanks !
> 
> Sent from Yahoo Mail on Android
> 
> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
> <wh...@my...<mailto:wh...@my...>> wrote:
> Hyperthreading is an optimization aimed at addressing poorly optimized
> code. The idea is that most codes cannot drive the backend hardware
> (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you
> can, amongst several threads, find enough work to keep the backend busy.
> 
> ATLAS (or any optimized linear algebra library) already runs the FPU at
> its maximal rate supported by the cache architecture after cache blocking.
> 
> If you can already drive the backend at >90% of peak, then
> hyperthreading can actually *lose* you performance, as the threads bring
> conflicting data in the cache.
> 
> It's usually not a night and day difference, but I haven't measured it
> in the huge blocking era used by recent developer releases (it may be
> worse there).
> 
> My general recommendation is turn off hyperthreading for highly
> optimized codes, and turn it on for relatively unoptimized codes.
> 
> As to which core IDs correspond to the physical cores, that varies by
> machine. On x86, you can use CPUID to determine that if you are
> super-knowledgeable. I usually just turn it off in the BIOS, because I
> don't like something that may thrash my cache running, even if it might
> occasionally help :)
> 
> Cheers,
> Clint
> 
> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> Hello,Would like go check if my understanding is correct for compiling
> Atlas on a machine that has multiple CPUs and hyperthreading.
> I have two types of machine:
> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU,
> each with 8 Cores, hyperthreaded, 2 threads per core
> So when I compile Atlas, is it correct that I should use:
> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID
> is from 0-7 and 0-15).
> That means the number 8 or 16 is the total cores on the machine, not
> number of cores per CPU. Am I correct ?
> I also read somewhere saying that Atlas supports Hyperthreading. What
> does this mean ?
> Does this mean:1. I do not need to disable hyperthreading in BIOS (no
> performance difference whether it is enabled or disabled, as long as the
> number of threads and affinity IDs are set correctly when compiling
> Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
> Thank you very much,
> lixin
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> 
> --
> Jeff Hammond
> jef...@gm...<mailto:jef...@gm...>
> http://jeffhammond.github.io/
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> --
> **********************************************************************
> ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley<http://www.csc.lsu.edu/~whaley> **
> **********************************************************************
> 
> 
> 
> 
> --
> Jeff Hammond
> jef...@gm...<mailto:jef...@gm...>
> http://jeffhammond.github.io/
> 
> 
-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
**********************************************************************

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: Jeff H. <jef...@gm...> - 2017年06月29日 22:16:03

I don't see any negative impact from using HT relative to not using HT, at
least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% gain here is
irrelevant and may be due to thermal effects (this box is in my cubicle,
not an air-conditioned machine room).
$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
 BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765
Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930
HT on
$ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
 BLAS_NAME dim1 dim2 dim3 seconds Gflop/s
Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073
Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853
I would be interested to see folks post data to support the argument
against HT.
Jeff
On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
mat...@li...> wrote:
>
> Thank you very much for quick response. Just to check if my understanding
is correct :
>
> 1. By turning off cpuid in bios, I only need to use -t N to build Atlas
right?
>
> 2. The N in -t N is the total number of threads on the machine, not per
Cpu right ?
>
> 3. One more question I have is, how to set the correct -t N for mpi based
application.
> Let's say on the 2-cpu machine with 4 cores per CPU, should I use -t
4 or -t 8 if I rum my application with 2 mpi processes :
> mpirun -n 2 myprogram
>
> Many thanks !
>
> Sent from Yahoo Mail on Android
>
> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
> <wh...@my...> wrote:
> Hyperthreading is an optimization aimed at addressing poorly optimized
> code. The idea is that most codes cannot drive the backend hardware
> (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you
> can, amongst several threads, find enough work to keep the backend busy.
>
> ATLAS (or any optimized linear algebra library) already runs the FPU at
> its maximal rate supported by the cache architecture after cache blocking.
>
> If you can already drive the backend at >90% of peak, then
> hyperthreading can actually *lose* you performance, as the threads bring
> conflicting data in the cache.
>
> It's usually not a night and day difference, but I haven't measured it
> in the huge blocking era used by recent developer releases (it may be
> worse there).
>
> My general recommendation is turn off hyperthreading for highly
> optimized codes, and turn it on for relatively unoptimized codes.
>
> As to which core IDs correspond to the physical cores, that varies by
> machine. On x86, you can use CPUID to determine that if you are
> super-knowledgeable. I usually just turn it off in the BIOS, because I
> don't like something that may thrash my cache running, even if it might
> occasionally help :)
>
> Cheers,
> Clint
>
> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> > Hello,Would like go check if my understanding is correct for compiling
Atlas on a machine that has multiple CPUs and hyperthreading.
> > I have two types of machine:
> > - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU,
each with 8 Cores, hyperthreaded, 2 threads per core
> > So when I compile Atlas, is it correct that I should use:
> > -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID
is from 0-7 and 0-15).
> > That means the number 8 or 16 is the total cores on the machine, not
number of cores per CPU. Am I correct ?
> > I also read somewhere saying that Atlas supports Hyperthreading. What
does this mean ?
> > Does this mean:1. I do not need to disable hyperthreading in BIOS (no
performance difference whether it is enabled or disabled, as long as the
number of threads and affinity IDs are set correctly when compiling
Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
> > Thank you very much,
> > lixin
> >
> >
> >
> >
------------------------------------------------------------------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >
> >
> >
> > _______________________________________________
> > Math-atlas-devel mailing list
> > Mat...@li...
> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>
> >
>
>
>
------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>
>
>
------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>
--
Jeff Hammond
jef...@gm...
http://jeffhammond.github.io/

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: lixin c. <lix...@ya...> - 2017年06月29日 15:12:02

Thank you very much for quick response. Just to check if my understanding is correct :
1. By turning off cpuid in bios, I only need to use -t N to build Atlas right?
2. The N in -t N is the total number of threads on the machine, not per Cpu right ?
3. One more question I have is, how to set the correct -t N for mpi based application.   Let's say on the 2-cpu machine with 4 cores per CPU, should I use -t 4 or -t 8 if I rum my application with 2 mpi processes :   mpirun -n 2 myprogram 
Many thanks !
Sent from Yahoo Mail on Android 
 
 On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley<wh...@my...> wrote: Hyperthreading is an optimization aimed at addressing poorly optimized 
code. The idea is that most codes cannot drive the backend hardware 
(ALU/FPU, etc) at the maximal rate, so if you duplicate registers you 
can, amongst several threads, find enough work to keep the backend busy.
ATLAS (or any optimized linear algebra library) already runs the FPU at 
its maximal rate supported by the cache architecture after cache blocking.
If you can already drive the backend at >90% of peak, then 
hyperthreading can actually *lose* you performance, as the threads bring 
conflicting data in the cache.
It's usually not a night and day difference, but I haven't measured it 
in the huge blocking era used by recent developer releases (it may be 
worse there).
My general recommendation is turn off hyperthreading for highly 
optimized codes, and turn it on for relatively unoptimized codes.
As to which core IDs correspond to the physical cores, that varies by 
machine. On x86, you can use CPUID to determine that if you are 
super-knowledgeable. I usually just turn it off in the BIOS, because I 
don't like something that may thrash my cache running, even if it might 
occasionally help :)
Cheers,
Clint
On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> Hello,Would like go check if my understanding is correct for compiling Atlas on a machine that has multiple CPUs and hyperthreading.
> I have two types of machine:
> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU, each with 8 Cores, hyperthreaded, 2 threads per core
> So when I compile Atlas, is it correct that I should use:
> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID is from 0-7 and 0-15).
> That means the number 8 or 16 is the total cores on the machine, not number of cores per CPU. Am I correct ?
> I also read somewhere saying that Atlas supports Hyperthreading. What does this mean ?
> Does this mean:1. I do not need to disable hyperthreading in BIOS (no performance difference whether it is enabled or disabled, as long as the number of threads and affinity IDs are set correctly when compiling Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
> Thank you very much,
> lixin
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Math-atlas-devel mailing list
Mat...@li...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: [atlas-devel] Compiling Atlas with hyperthreading

From: R. C. W. <wh...@my...> - 2017年06月29日 14:20:34

Hyperthreading is an optimization aimed at addressing poorly optimized 
code. The idea is that most codes cannot drive the backend hardware 
(ALU/FPU, etc) at the maximal rate, so if you duplicate registers you 
can, amongst several threads, find enough work to keep the backend busy.
ATLAS (or any optimized linear algebra library) already runs the FPU at 
its maximal rate supported by the cache architecture after cache blocking.
If you can already drive the backend at >90% of peak, then 
hyperthreading can actually *lose* you performance, as the threads bring 
conflicting data in the cache.
It's usually not a night and day difference, but I haven't measured it 
in the huge blocking era used by recent developer releases (it may be 
worse there).
My general recommendation is turn off hyperthreading for highly 
optimized codes, and turn it on for relatively unoptimized codes.
As to which core IDs correspond to the physical cores, that varies by 
machine. On x86, you can use CPUID to determine that if you are 
super-knowledgeable. I usually just turn it off in the BIOS, because I 
don't like something that may thrash my cache running, even if it might 
occasionally help :)
Cheers,
Clint
On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> Hello,Would like go check if my understanding is correct for compiling Atlas on a machine that has multiple CPUs and hyperthreading.
> I have two types of machine:
> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU, each with 8 Cores, hyperthreaded, 2 threads per core
> So when I compile Atlas, is it correct that I should use:
> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID is from 0-7 and 0-15).
> That means the number 8 or 16 is the total cores on the machine, not number of cores per CPU. Am I correct ?
> I also read somewhere saying that Atlas supports Hyperthreading. What does this mean ?
> Does this mean:1. I do not need to disable hyperthreading in BIOS (no performance difference whether it is enabled or disabled, as long as the number of threads and affinity IDs are set correctly when compiling Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
> Thank you very much,
> lixin
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>

[atlas-devel] Compiling Atlas with hyperthreading

From: lixin c. <lix...@ya...> - 2017年06月29日 03:32:37

Hello,Would like go check if my understanding is correct for compiling Atlas on a machine that has multiple CPUs and hyperthreading.
I have two types of machine:
- 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU, each with 8 Cores, hyperthreaded, 2 threads per core
So when I compile Atlas, is it correct that I should use:
-tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID is from 0-7 and 0-15).
That means the number 8 or 16 is the total cores on the machine, not number of cores per CPU. Am I correct ?
I also read somewhere saying that Atlas supports Hyperthreading. What does this mean ?
Does this mean:1. I do not need to disable hyperthreading in BIOS (no performance difference whether it is enabled or disabled, as long as the number of threads and affinity IDs are set correctly when compiling Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
Thank you very much,
lixin

Re: [atlas-devel] Number of threads selection

From: R. C. W. <rcw...@ls...> - 2017年03月20日 13:47:19

So far, it still must be compile-time chosen. We need it for affinity, 
which is necessary when OS does a poor job of managing the threads.
Eventually I may be able to support run-time choice for the OpenMP 
implementation, which has its own scheduler (though in the cases where 
ATLAS used affinity in past it got horrible performance). Right now, I 
have not yet gotten time to look at that part of the threading package, 
as I'm in the middle of big kernel redesign still.
Regards,
Clint
On 03/19/2017 12:20 PM, José Luis García Pallero wrote:
> Hello:
>
> I've not used ATLAS for a while and I would like to ask if the library
> has yet the ability to select the number of execution thread at
> execution time instead of at compilation time. I remember that this
> feature was discussed in the past, but I'm not sure if finally it was
> considered for the future
>
> Thanks
>
-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
**********************************************************************

[atlas-devel] Number of threads selection

From: José L. G. P. <jgp...@gm...> - 2017年03月19日 17:21:06

Hello:
I've not used ATLAS for a while and I would like to ask if the library
has yet the ability to select the number of execution thread at
execution time instead of at compilation time. I remember that this
feature was discussed in the past, but I'm not sure if finally it was
considered for the future
Thanks
-- 
*****************************************
José Luis García Pallero
jgp...@gm...
(o<
/ / \
V_/_
Use Debian GNU/Linux and enjoy!
*****************************************