SourceForge logo
SourceForge logo
Menu

math-atlas-results — List specifically for timing results

You can subscribe to this list here.

2001 Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
(9)
Oct
(3)
Nov
Dec
2002 Jan
Feb
(1)
Mar
Apr
May
(1)
Jun
(1)
Jul
Aug
(1)
Sep
Oct
Nov
Dec
2003 Jan
Feb
Mar
Apr
May
(1)
Jun
(2)
Jul
(2)
Aug
Sep
(1)
Oct
Nov
Dec
(1)
2004 Jan
Feb
Mar
(1)
Apr
May
Jun
Jul
(2)
Aug
Sep
Oct
Nov
Dec

Showing results of 26

1 2 > >> (Page 1 of 2)
From: <rw...@cs...> - 2004年07月24日 21:33:38
Guys,
I just released 3.7.8, where I get a little faster for double precision on
the Efficeon. Here's an updated table:
 PEAK SSE2 dMM-ic dMM-oc dGEMM
 ==== ========= =========== =========== ===========
1.6Ghz Ham64 3200 3051(98%) 2984(93/98%) 2937(92/98%) 2805(88/96%)
2.8Ghz P4E 5600 5178(92%) 4492(80/87%) 4425(79/99%) 4303(77/97%)
1.0Ghz PIII 1000 -------- 933(93%) 840(84/90%) 760(76/90%)
1.0Ghz Eff3.7.8 2000 1790(90%) 1595(80/89%) 1371(69/86%) 1280(64/93%)
 " asymptotic 995(49/72%)
1.0Ghz Eff3.7.7 2000 1790(90%) 1514(76/85%) 1309(65/86%) 1201(60/92%)
 " asymptotic 970(49/74%)
As you can see, the in-cache numbers are now quite good. Out-of-cache,
as might be expected on this arch, goes from bad to terrible.
As before, the main problem on this arch is the fact that large problems
to not perform as well as small problems. For all other archs, the peak
dGEMM number reported above is essentially the same as the asymptotic DGEMM
speed, but for the Efficeon, as you see, it is way under.
Cheers,
Clint
From: <rw...@cs...> - 2004年07月18日 14:32:16
Guys,
In trying to understand the effeceon results, I built the following
table people might find interesting. It shows the performance of an
SSE2 all-register code (SSE2), ATLAS best matmul kernel ran in-cache (dmm-ic),
the same kernel ran out-of-cache (dmm-oc), and full DGEMM times (dGEMM).
Then, I have % of peak, and of the preceeding column.
 PEAK SSE2 dMM-ic dMM-oc dGEMM
 ==== ========= =========== =========== ===========
1.6Ghz Ham64 3200 3051(98%) 2984(93/98%) 2937(92/98%) 2805(88/96%)
2.8Ghz P4E 5600 5178(92%) 4492(80/87%) 4425(79/99%) 4303(77/97%)
1.0Ghz PIII 1000 -------- 933(93%) 840(84/90%) 760(76/90%)
1.0Ghz Eff 2000 1790(90%) 1514(76/85%) 1309(65/86%) 1201(60/92%)
In building a table like this, you are looking for what pieces you are
lacking. In the efficeon case, it looks like every step other than
the last is slightly low. Unfortunately, the strongest correlation I found
was in the time I have spent tuning the particular arch :)
Cheers,
Clint
P.S.: Efficeon timings at:
 http://math-atlas.sourceforge.net/timing/Efficeon/
From: <rw...@cs...> - 2004年03月21日 01:50:18
I've posted P4E/P4 SSE3/SSE2 timings at:
 http://math-atlas.sourceforge.net/timing/3_7_3/index.html
Cheers,
Clint
From: <rw...@cs...> - 2003年12月23日 15:45:53
Are available at:
 http://math-atlas.sourceforge.net/timing/
Cheers,
Clint
From: R C. W. <rw...@cs...> - 2003年09月14日 02:33:43
Guys,
I've posted 3.5.10 timings comparing all precisions at:
 http://math-atlas.sourceforge.net/timing/3_5_10/index.html
Regards,
Clint
From: R C. W. <rw...@cs...> - 2003年07月16日 16:08:38
Bcc : antoine
Guys,
ATLAS 3.5.7 is out. The big news in this release is a *much* improved
[D,S]SYRK. In the past, LU ran much faster than Cholesky, due to our
poor SYRK performance. With the beefed up SYRK, Cholesky now runs at
roughly the same speed as LU. So, if you are using Cholesky or SYRK, you
will want to grab 3.5.7.
I have posted some timings to:
 http://math-atlas.sourceforge.net/timing/Syrk3.5.7.html
Regards,
Clint
From: R C. W. <rw...@cs...> - 2003年07月11日 20:48:16
From: R C. W. <rw...@cs...> - 2003年06月28日 03:56:43
After an ugly time spent wrestling with our good friend the GUI, I have
some pretty Opteron performance graphs available at:
 http://math-atlas.sourceforge.net/timing/OptPerf.html
The quick overview is a sustained 88% of theoretical peak for uniprocessor
DGEMM, and 85% for threaded. This machine is a monster.
Cheers,
Clint
From: R C. W. <rw...@cs...> - 2003年06月17日 23:10:08
I've done some timings on my 600Mhz Athlon classic, comparing 3.5.4 with 3.4.1,
and once again I underwhelm the crowds. In these timings I did not see any
greater SGEMM performance, but at least sLU is no longer _slower_.
Theoretically, 3.5.4 should at least have better cleanup, so small-problem
computations *should* benefit, but you don't really see it here. All in all,
3.5.4 is not much of an improvement for Athlons :< Not that suprising,
considering how good Julian's stuff was in the first place, I guess . . .
Cheers,
Clint
VERS OP 100 200 300 400 500 600 700 800 900 1000
===== === ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
3.4.1 sMM 857 884 900 931 909 960 966 966 979 962
3.5.4 sMM 740 884 958 931 943 1054 966 975 985 971
3.4.1 sLU 362 551 629 680 748 749 761 775 783 803
3.5.4 sLU 370 541 636 686 724 757 770 775 809 803
3.4.1 dMM 727 784 879 873 893 960 915 914 941 935
3.5.4 dMM 727 784 900 873 893 982 915 948 947 930
3.4.1 dLU 325 464 558 591 612 664 682 689 736 732
3.5.4 dLU 328 475 565 600 603 654 703 682 725 716
3.4.1 cMM 870 900 939 948 925 954 952 942 955 947
3.5.4 cMM 857 900 939 931 935 965 953 959 969 966
3.4.1 cLU 492 618 727 749 784 794 809 827 838 844
3.5.4 cLU 477 630 727 741 784 805 816 822 856 846
3.4.1 zMM 779 800 847 867 870 872 888 883 900 893
3.5.4 zMM 750 835 882 853 862 900 897 890 906 894
3.4.1 zLU 435 523 610 620 680 694 709 718 736 757
3.5.4 zLU 435 542 622 650 666 706 709 722 741 764
From: R C. W. <rw...@cs...> - 2003年05月04日 15:10:02
Guys,
In order to allow for dynamically linked libs, I've been trying to get
performance similar to Julian's athlon code with code written in gas
assembler. After a whole lot of work, I got a kernel that is not
quite as efficient in-cache as Julian's kernel, but seemed to tie or
beat it out-of-cache for all precisions except double. Even for double,
however, my new kernels could be used for cleanup.
This took an entire week of evenings and two weekends of intensive effort,
capped by the posting of 3.5.2 to sourceforge. I then ran the timing
to see what kind of speedup I got. The short answer: not really any :-{
I ran problems between 100-1000 on my 600Mhz Athlon classic, for all four
precisions, using matrix multiply and LU factorization.
For double, the addition of cleanup did help slightly. I was most excited
about single, where use of a larger blocking factor (60 vs. 30) allowed
me to obtain a slighly higher peak matmul. However, the increase in NB
caused LU to run *slower*, so maybe I'll back out this change for 3.5.3.
For single precision complex, again there was a larger NB for increased 
matmul peak. The LU times appear to run within clock speed of each other.
For double precision complex, there appears to have been a microscopic
speedup. 
Yay,
Clint
Timings for a 600Mhz Athlon classic, with 512K cache, but timers flushing 2MB:
VERS OP 100 200 300 400 500 600 700 800 900 1000
===== === ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
3.4.1 sMM 800 897 913 931 943 982 953 966 959 966
3.5.2 sMM 732 897 928 931 962 982 980 957 978 976
3.4.1 sLU 373 546 636 687 724 757 761 784 790 784
3.5.2 sLU 370 492 587 626 666 705 729 741 771 765
3.4.1 dMM 714 822 887 883 877 939 903 906 935 935
3.5.2 dMM 723 789 914 898 909 900 915 923 947 930
3.4.1 dLU 321 462 550 591 589 636 659 696 719 716
3.5.2 dLU 332 465 565 586 622 685 678 696 719 709
3.4.1 cMM 857 900 939 914 917 944 939 944 951 946
3.5.2 cMM 822 886 939 931 943 960 959 954 969 960
3.4.1 cLU 473 664 727 758 775 783 795 827 834 844
3.5.2 cLU 473 637 735 726 757 817 795 822 830 846
3.4.1 zMM 723 800 864 883 870 882 880 881 891 884
3.5.2 zMM 833 823 864 883 870 900 891 892 903 893
3.4.1 zLU 420 510 573 632 687 681 703 711 736 755
3.5.2 zLU 417 547 616 620 640 689 703 726 739 764
From: Mikhail K. <ku...@fr...> - 2002年08月05日 18:58:03
1)` I've installed atlas-3.4.1 under Linux RH 7.2 for Athlon XP 1800+
on Tyan S2460 2-CPUs SMP motherboard w/using of gcc-3.1.
(I do this because pre-built Athlon libraries do not
includes multi-threaded libraries)
 But the result of installation looks something strange:
/ATLAS/include/Linux_ATHLONSSE1_II/cacheedge.h was set to
98304 ! But it's known that recommended optimal value must be
217088 ! 
make xdfindce says that all is "up-to date" and don't change 
cacheedge.h value :-(
What should I do in this strange situation ?
2) On pre-built Atlas 3.4.1 for ATHLON (I used ifc 5.0, and
add at the linking g77 run-time library to resolve the references
existing in pre-built for g77 libraries, and simple force
to suppress some ld errors, and as a result executable
module works successfully) I received on Linpack n=1000
(w/call of LAPACK Atlas routines) more than 1.5 GFLOPS.
( > 1 FLOP per cycle)
Is it possible taking into account that dFPU test under
installation of atlas 3.4.1 says that "Separate multiply and add
instruction w/3 cycle pipeline" gives 1670 MFLOPS max
(it's known that in Linpack test daxpy must be limiting) ?
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow 
From: hyuniq<hy...@in...> - 2002年06月19日 02:34:39
<html>
<head>
<title>부천화훼농장입니다.</title>
<meta name="generator" content="Namo WebEditor v4.0">
</head>
<body bgcolor="white" text="black" link="blue" vlink="purple" alink="red">
<table cellpadding="0" cellspacing="0">
 <tr>
 <td width="740" bgcolor="aqua">
 <p><b><font color="navy">부천화훼농장입니다. &nbsp;농장에서 직배함으로 
 &nbsp;믿으실 수 가 있습니다</font></b></p>
 </td>
 </tr>
 <tr>
 <td width="740" bgcolor="aqua">
 <p><b><font color="navy">부천, 인천, 경기지역에 전문으로 합니다. 
 연락주세요 감사합니다.</font></b></p>
 </td>
 </tr>
 <tr>
 <td width="740" bgcolor="aqua">
 <p align="left"><b><font color="red">080-909-1122 &nbsp;/ &nbsp;02-672-4821 
 &nbsp;&nbsp;/ &nbsp;011-9070-6631</font></b></p>
 </td>
 </tr>
 <tr>
 <td width="740"><table cellspacing="0" bordercolordark="#b6e9fc" bordercolorlight="black" cellpadding="0" width="664">
 <tr>
 <td width="664"> <TABLE border=0 cellPadding=2 cellSpacing=2 width="573">
 <TBODY>
 <TR>
 <TD align=right bgColor="green" class=td1 colSpan=2 height="34" width="297">
 <p align="center"><font color="white" size="2" face="바탕"><b>동양란/서양란/<font color="aqua" size="2" face="바탕"><b>농장직배</b></font><font size="2" face="바탕" color="aqua">
 </font><font color="aqua" size="2" face="바탕"><b> 
 </b></font></b></font></p></TD>
 <TD width="262" bgcolor="green" height="34">
 <p align="center"><font size="2" face="바탕" color="white"><b>모든분께 
 사랑을 드립니다.</b></font></p></TD>
</TR>
 <TR>
 <TD class=td1 colSpan=2 height="57" width="297">
 <table border="0" width="255">
 <tr>
 <td width="249" height="18">
 <p align="center"><font size="2" face="바탕">고마우신분들께 
 사랑과 감사의 마음을</font></p>
 </td>
 </tr>
 <tr>
 <td width="249">
 <p align="center"></p>
 <div id="layer2" style="Z-INDEX: 1; LEFT: 41px; WIDTH: 58px; POSITION: absolute; TOP: 295px; HEIGHT: 15px" 
 >
 <p><font color="blue" size="2" face="바탕"><b>동양란</b></font></p>
 </div>
&nbsp;&nbsp;<font size="2" face="바탕">전</font>
 <div id="layer3" style="Z-INDEX: 1; LEFT: 167px; WIDTH: 56px; POSITION: absolute; TOP: 294px; HEIGHT: 17px" 
 >
 <p><font color="red" size="2" face="바탕"><b>서양란</b></font></p>
 </div>
<font size="2" face="바탕">하세요. 
 저희들과 함께요.....</font>
 <P></P>
 </td>
 </tr>
 </table>
</TD>
 <TD width="262" rowspan="2" height="151">
 <table border="0" width="302">
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕"><FONT color=blue>싱싱한 생화</FONT> 의 
 </FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕" color="red">축하</FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">, 
 </FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕" color="blue">근조</FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕"> 
 </FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕" color="red">화환</FONT><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕"> 
 </FONT></p></td>
 </tr>
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">농장생산 24시간이내
신선한 꽃만 사용</FONT></p></td>
 </tr>
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">합니다.&nbsp;관엽수, 
 동양란.서양란................. &nbsp; </FONT></p>
 </td>
 </tr>
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">꽃배달에도 등급
이 있습니다. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></p>
 </td>
 </tr>
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">&nbsp;이제부터는 
 부천 화훼 &nbsp;농장에서 &nbsp;</FONT></p>
 </td>
 </tr>
 <tr>
 <td width="296">
 <p align="center" style="LINE-HEIGHT: 1"><FONT 
 style="FONT-SIZE: 9pt; LINE-HEIGHT: 12pt; FONT-FAMILY: 굴림" size="2" face="바탕">해결해 
 드립니다.</FONT></p>
 </td>
 </tr>
 <tr>
 <td width="296" bgcolor="green">
 <p><b><font color="fuchsia" size="2">.</font></b></p>
 </td>
 </tr>
 </table>
</TD>
</TR>
 <TR>
 <TD width="125" height="92">
 <TABLE border=0 cellPadding=0 cellSpacing=0 height=102 width=102>
 <TBODY>
 <TR bgColor=#000000>
 <TD>
 <DIV align=center><A 
 href="file-flower-1.htm" 
 target=detail><img src="file:///C|/love/fl-01-02-2.jpg" width="109" height="100" border="0"></A></DIV></TD></TR></TBODY></TABLE></TD>
 <TD width="166" height="92">
 <TABLE border=0 cellPadding=0 cellSpacing=0 height=102 width=102>
 <TBODY>
 <TR bgColor=#000000>
 <TD>
 <DIV align=center><A 
 href="file-flower-2.htm" 
 target=detail><img src="file:///C|/love/fl-02-04-2.jpg" width="106" height="100" border="0"></A></DIV></TD></TR></TBODY></TABLE></TD>
</TR>
</TBODY></TABLE></td>
 </tr>
 <tr>
 <td width="664" height="214"><TABLE border=0 width="627">
 <TR>
 <TD width="225" height="272" rowspan="3">
 <p align="right"></p>
 <div id="layer4" style="Z-INDEX: 1; LEFT: 69px; WIDTH: 99px; POSITION: absolute; TOP: 516px; HEIGHT: 28px" 
 >
 <p><b><font size="2" face="바탕" color="red">관 
 &nbsp;엽</font></b></p>
 </div>
<font color="green" size="2">&nbsp;</font>
 <p align="center">&nbsp;<A href="file-flower-5.htm" target=detail><img src="file:///C|/love/fl-05-12-1.jpg" width="149" height="226" border="0"></A></p></TD>
 <TD vAlign=center width="392">
 <p><font color="green"><b><u>기업체 동양란.서양란.관엽등 
 전문 대행해 <font color="green" size="2"><b><u>드려요<font color="green"><b><u>!!!</u></b></font></u></b></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u></u></b></u></b></font></font></u></b><u><font color="green"><font color="green" size="2"><u><u><font color="blue" size="2"><br>부천 
 화훼농장은 단순한 란 배달이 아닌 최고의 품질과 최저의</font></u></u></font></font></u></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><u><font color="green"><font color="green" size="2"><u><u><font color="blue" size="2">가격으로정성을 
 다한 </font></u></u></font></font></u></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><u><font color="green"><font color="green" size="2"><u><u><font color="blue" size="2">사랑과 
 아름다움을 전해드립니다.</font></u></u></font></font></u></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"></font>&nbsp;</p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><font color="blue" size="2">부천화훼농장은 
 </font><font color="red" size="2">많은 단계의 <b>중계수수료</b>가 없기</font><font color="blue" size="2">에 </font><font color="green" size="2"><b>상품의 </b></font></u></b></u></b></font></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><font color="green" size="2"><b>품질이 
 &nbsp;탁월</b></font><font color="blue" size="2">합니다.</font></u></b></u></b></font></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><FONT color="blue" size="2">특히</FONT></u></b></u></b></font></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><FONT color="blue" size="2">◈거래처관리◈동창회◈사우회 ◈향우회 ◈조합</FONT></u></b></u></b></font></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><FONT color="blue" size="2">◈소규모모임 
 ◈결혼기념일 및◈생일을 </FONT></u></b></u></b></font></font></u></b></font></p>
 <p style="LINE-HEIGHT: 0"><font color="green"><b><u><font color="green"><font color="green" size="2"><b><u><b><u><FONT color="blue" size="2">컴퓨터 예약 관리하여 드립니다,.</FONT></u></b></u></b></font></font></u></b></font>
</p>
</TD></TR> <TR>
 <TD vAlign=center width="392" bgcolor="green">
 <p>&nbsp;</p>
</TD></TR> <TR>
 <TD vAlign=center width="392">
 <p>&nbsp;</p>
</TD></TR></TABLE></td>
 </tr>
</table></td>
 </tr>
 <tr>
 <td width="740">
 <p>&nbsp;</p>
 </td>
 </tr>
 <tr>
 <td width="740">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;이 전자우편은 (부천화훼농장)을 광고할 목적으로 발송되어지는 메일입니다.<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;이후에 메일을 받지 않으시려면 
메일수신거부 버튼을 눌러주세요.<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;감사합니다.&nbsp;&nbsp; 
 <A href="
mailto:hyuniq_<hy...@in...>?subject=수신거부&Body=메일수신(mat...@li...)을 원치않습니다">수신거부</A><BR></td>
 </tr>
</table>
<p>&nbsp;</p>
</body>
</html>
From: Andrew S. <an...@di...> - 2002年05月24日 03:39:58
Hi,
I saw a previous query in the list archives about the intel mkl. I
would be interested if there was any follow up on this. I am also
interested if anyone knows about performance with small matrices and
ATLAS, especially compared to what I think was the SIMD optimised
intel small matrix library (i.e. 4x4...6x6 etc.).
many thanks,
andrew slater
p.s. I am not on this list and would appreciate it if any replies
could also be sent to my email address.
From: Michael Z. <mic...@po...> - 2002年02月28日 08:43:41
Hi!
Does anybody know how the intel mkl 5.1 for linux compares to atlas on a
pentium 4? I'm thinking about using it with mpb
(http://ab-initio.mit.edu/mpb/). It's author says
>MPB requires high performance for multiplication of special
>"tall, thin" matrices, which ATLAS has special code for, whereas many
>vendor codes are tuned for the squarish matrices (that most often appear
>in benchmarks).
So will it be worth trying mkl? Or do I simply confuse two things: optimized
source code for linear algebra and special machine code generation for the
p4-series?
Thanks in advance
Michael Zedler
From: Wilkens, T. <wi...@ui...> - 2001年10月22日 16:34:43
Hi Everyone,
 I'm trying to build a matrix * matrix single precision ( AxB ) optimized
kernel for the Athlon. However, I'm having problems getting high
throughput. I thought maybe someone here could help me out. I'm using SSE.
The kernel of my code.. involves multiplying a 64x64 submatrix of A
times a 64x64 submatrix of B. The submatrices are prefetched into
cache.. and this kernel should fly at the speed of light. Both
submatrix A and B are in L1. My efforts to date are just for testing 
purposes, so the blocking factor of 64 is likely to change. But for those
interested.. I have also tested blocking factors of NB = 36 and NB = 48.
I multiply 4 rows of submatrix A at a single time times a column of
submatrix B.Then I move to the next 4 rows of submatrix A... and so
on. The entire multiplication of submatrix A times a "single" column
of B is completely unrolled. Then I loop over the columns of B.
It's pivotal that I get "stellar performance" in the dot product 4
rows of submatrix A upon the 64 floats in the column of B submatrix. 
The data is arranged as such:
** register "edi" points to the first element of submatrix A
** register "esi" points to the column of submatrix B
Notes:
======
I bias the edi and esi registers by 128 bytes.. so I can sweep through
the entire 64 floats (256 bytes) of each row of A. In this format:
[edi-128] == address of first element of first row of submatrix A
[edi+112] == address of last element of first row of submatrix A
SSE uses xmm registers and each contains 4 floats.. or 16 bytes. So I
load 16 bytes at a time into the xmm registers.
Ok.. the code goes something like this:
=========================================================================
.
..
...
add edi,128
add esi,128
mov eax,256 ; size in bytes of a single row of submatrix A
mov ebx,768 ; size in bytes of a 3 rows of submatrix A
xorps xmm5,xmm5
xorps xmm6,xmm6
xorps xmm7,xmm7
xorps xmm8,xmm8
movaps xmm1,XMMWORD PTR [edi-128] ; First 4 floats of row 1 of A
movaps xmm2,XMMWORD PTR [edi+eax-128] ; First 4 floats of row 2 of A
movaps xmm3,XMMWORD PTR [edi+eax*2-128]; First 4 floats of row 3 of A
movaps xmm4,XMMWORD PTR [edi+ebx-128] ; First 4 floats of row 4 of A
mulps xmm1,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 1 with col
mulps xmm2,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 2 with col
mulps xmm3,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 3 with col
mulps xmm4,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 4 with col
addps xmm5,xmm1 ; accumulate dot product of row 1 with col
addps xmm6,xmm2 ; accumulate dot product of row 2 with col
addps xmm7,xmm3 ; accumulate dot product of row 3 with col
addps xmm8,xmm4 ; accumulate dot product of row 4 with col
; WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers 
; with data 16 bytes in front of our previous accesses
movaps xmm1,XMMWORD PTR [edi-112]
movaps xmm2,XMMWORD PTR [edi+eax-112]
movaps xmm3,XMMWORD PTR [edi+eax*2-112]
movaps xmm4,XMMWORD PTR [edi+ebx-112]
mulps xmm1,XMMWORD PTR [esi-112]
mulps xmm2,XMMWORD PTR [esi-112]
mulps xmm3,XMMWORD PTR [esi-112]
mulps xmm4,XMMWORD PTR [esi-112]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4
movaps xmm1,XMMWORD PTR [edi-96]
movaps xmm2,XMMWORD PTR [edi+eax-96]
movaps xmm3,XMMWORD PTR [edi+eax*2-96]
movaps xmm4,XMMWORD PTR [edi+ebx-96]
mulps xmm1,XMMWORD PTR [esi-96]
mulps xmm2,XMMWORD PTR [esi-96]
mulps xmm3,XMMWORD PTR [esi-96]
mulps xmm4,XMMWORD PTR [esi-96]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4
movaps xmm1,XMMWORD PTR [edi-80]
movaps xmm2,XMMWORD PTR [edi+eax-80]
movaps xmm3,XMMWORD PTR [edi+eax*2-80]
movaps xmm4,XMMWORD PTR [edi+ebx-80]
mulps xmm1,XMMWORD PTR [esi-80]
mulps xmm2,XMMWORD PTR [esi-80]
mulps xmm3,XMMWORD PTR [esi-80]
mulps xmm4,XMMWORD PTR [esi-80]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4
.
..
...
=========================================================================
i'm not getting stellar performance in each 12 SSE instruction package
above. In each package.. there are 32 floating point operations. it's
taking me.. I believe 13 cycles to execute each of these instructions. 
Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
2.46 FLOPS/CYCLE. This is much to low. does anybody see anything
wrong with how I've set up these instructions. I do realize that the
first move instruction in each package is 4 bytes.. the other 3 are 5
bytes.. which means I can not decode more than 1 in any given clock
cycle. Is this a problem, or can the Athlon only decode AND EXECUTE 1
movaps instruction per clock cycle. 
Any and all help is greatly appreciated. I do not have a P4 and am not
familiar with it's capabilities.. though I wonder how many of the following
instructions:
movaps
mulps
addps
the p4 can execute in a given clock cycle. 
Thanks for any assistance...
tim wilkens
BTW.. this message has also been posted on comp.lang.asm here:
http://groups.google.com/groups?hl=en&group=comp.lang.asm.x86&selm=7b1e74d1.
0110211952.146ca68a%40posting.google.com
From: R C. W. <rw...@cs...> - 2001年10月19日 02:36:20
Guys,
The wheels are still turning to get out the 3.3.8 release. I'm sending some
pre-release timings, just to give hope to the Athlon users out there.
Julian Ruhe has submitted an assembly-language kernel that improves ATLAS's
double precision Athlon performance by over 25%. Just to whet your appetite,
I include some timings using his new kernel below. I'm comparing my
development tree using his kernel (mislabeled as 3.3.8) against an old release
I had setting around on the machine, 3.3.2. 3.3.2 will have the same DGEMM
performance as the present release, 3.3.7. My development tree adds no
performance wins over 3.3.7, so the whole difference you see is Julian's kernel.
The kernels are written in nasm assembly, and will be available in source form
for the curious in the next release. This is why we can't just give you the
kernel to add to your 3.3.7 stuff: I had to add additional kernel support
for non-C contributions (our other assembly routines used gnu assembler, and
thus could be handled by gcc).
The numbers are for a 1.2Ghz Athlon (pre-Athlon4) with DDR memory. The kernel
performs similarly for older systems (you get about the same % of peak on
my 600Mhz Athlon classic, roughly 920Mflop) . . .
And before someone asks, yes, this is getting the right answer as well :->
Cheers,
Clint
3.3.2: Old ATLAS release on same machine
3.3.8: My development tree + Julian's Athlon kernel
1.2Ghz Athlon (2.4Gflop peak):
 100 200 300 400 500 600 700 800 900 1000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 1136.4 1271.8 1388.6 1280.0 1315.8 1393.5 1372.0 1383.8 1429.4 1418.4
3.3.8 dMM 1315.8 1377.8 1567.7 1600.0 1666.7 1728.0 1759.0 1735.6 1778.0 1785.7
3.3.2 dLU 676.0 841.3 914.1 982.7 970.8 1027.3 1054.3 1065.7 1116.3 1110.3
3.3.8 dLU 698.6 917.8 1052.5 1064.7 1147.7 1232.7 1202.2 1263.0 1312.4 1359.5
 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.8 dMM 1818.9 1799.3 1824.5 1842.7 1820.3 1823.3 1855.6 1832.7 1840.8 1848.7
3.3.8 dLU 1387.1 1406.4 1451.8 1472.1 1532.0 1536.0 1455.5 1446.2 1437.2 1473.8
From: Camm M. <ca...@en...> - 2001年10月12日 01:30:53
Greetings! Two items:
1) In trying to clean up the warnings on the l2 SSE kernels, I'm
 finding that many of them only appear when using the 2.96 (broken)
 gcc version on torc. 2.95.x and 3.0.2 don't appear to show these
 warnings, which refer to macro redefinitions, but I have only
 tested 3.0.2 on non-i386 machines. In any case, my code includes
 the same header multiple times, between each of which a few key
 macros are changed. And certain of the macros in the header file
 thus multiply included give the redefinition warning with 2.96,
 while others adjacently defined do not. No apparent rhyme or
 reason. I can certainly work around with undef's, or some moderate
 rewriting, but I'd like to get a minimal fix in first, so I'm
 wondering whether 2.96 is faulty in this respect and should be
 ignored. As long as I've used these macros, redefining the same
 macro to the same value never produces a warning, but maybe I've
 been relying on non-standard cpp all this time.
2) I've gotten interested in band matrices recently, and am wondering
 how atlas handles these. Take the extreme case of a diagonal
 matrix, 'band packed' so that the diagonal elements are contiguous in
 memory. For s{tsg}bmv, there seems to be no way the basic atlas
 code can hand this off to a kernel without moving the memory
 around. But this would be an easily vectorizeable operation.
 Should we have a 4rth l2 kernel to deal with band matrices?
Take care,
-- 
Camm Maguire			 			ca...@en...
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah
From: R C. W. <rw...@cs...> - 2001年09月27日 03:03:07
Guys,
I include below some timings on a 733 Mhz G4e (access courtesy of SourceForge
compile farm). For quite a while now, Apple's "half the Mhz, half again the
price" strategy has eluded me, but this machine ought to at least reduce the
screaming fits of it's laugh-test failure to at most a few furtive chuckles.
Essentially, it is still not going heads up against either the Athlon or
P4 (and if anyone hits me with the clock-for-clock crap, I will point out that
clock for clock the original Power chip is still the champ), but I think
it is cleaning the floor with the PIII, for instance (let's not mention 
price, though, eh?).
In single precision, its results are roughly 75% of a P4 clocked at twice
its speed (before you sneer with the "easy to be fast at low Mhz", I'll remind
you it is doing this with good ol' SDRAM, so that's pretty impressive), and it
almost doubles the performance of a 933Mhz PIII . . .
These results are much crappier on an original G4. Obviously, the extra level
of cache can't be hurting, but perhaps the greater instruction bandwidth,
etc., are helping as well.
I found it interesting to compare these timings to the ones I have previously
posted for the P4 and PIII. Note that gemm timings can be compared pretty
directly (no real change from 3.3.0 till 3.3.7), but the LU timings cannot
(3.3.7 has some speedups over 3.3.0) . . .
Cheers,
Clint
ATLAS 3.3.7 on 733Mhz G4e, 256K L2, 1MB L3
 100 200 300 400 500 600 700 800 900 1000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATL dLU 386.8 480.7 513.0 580.7 594.3 684.9 671.8 668.7 703.8 724.1
ATL dMM 416.7 687.7 771.4 914.3 757.6 919.1 879.5 922.5 928.7 943.4
ATL sLU 437.3 631.0 897.8 982.8 1109.4 1307.5 1343.7 1482.7 1566.4 1586.1
ATL sMM 1428.6 1600.0 1800.0 2560.0 2500.0 2400.0 2450.0 3011.8 2803.8 2631.6
 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
ATL dLU 733.3 758.7 786.6 799.7 809.0 819.4 833.0 838.5 840.4 837.0
ATL sLU 1744.4 1846.8 1922.1 1993.0 2058.4 2118.3 2167.8 2206.0 2261.3 2275.0
ATL sMM 2953.8 2814.4 3022.9 2858.8 3053.4 2937.4 3061.8 2936.7 3081.0 2995.0
From: R C. W. <rw...@cs...> - 2001年09月13日 21:21:51
<PRE>
Date: 2001年9月11日 22:24:15
Vers: ATLAS 3.3.2 and 3.3.4
Guys,
OK, trying to set a record for number of releases, I've just posted 3.3.5.
This gets rid of trtri out of lapack, improves IA64 complex performance,
and fixes a bug in the complex Cholesky tester.
I have figured out what was going on that I got no speedup with my new
kernel on the IA64. If you recall, 3.3.3 (which started all this quick
release madness) was supposed to be a IA64-improving release, due to IA64
prefetch, but when I timed it on machines I wasn't NDAd on, I got no
performance improvement. Even though it used the same compiler as my
NDAd machine, I got strange compiler problems as well.
Turns out the problem is that on the TestDrive machine, they have two different
compilers, and my 3.3.3 build was using a mixture of RedHat's baaaad gcc, and
the much better gcc 3.0.
So, this is the first performance hint for IA64: make sure you use gcc 3.0
everywhere in your ATLAS install: change all C compilers defined in your
Make.&lt;arch&gt; to explicitly reference it, and change all gcc refs in
ATLAS/tune/blas/gemm/CASES/?cases.flg as well.
Once this was done, I got the performance shown below. What we see is that
prefetch does not make a big performance improvement (3.3.2 and 3.3.4 are
almost the same speed asymptotically), but that the improved cleanup code
I wrote definitely helps small problems.
Prefetch definitely helps the Level 1 and 2 BLAS performance; the bad news
is that even the new performance is signally poor. This is because we have
no IA64-specific kernels for Level 1/2; the improvement is simply using the
best general kernel with prefetching enabled . . .
The timings on a 800Mhz IA64 are included below, all for double precision.
I do not have access to non-NDAd MKL; if anyone does, I'd love to see some
comparisons . . .
Cheers,
Clint
Timings for double precision, comparing ATLAS 3.3.2 vs. 3.3.4, all on a
800Mhz IA64. The performance of 3.3.4 is same as 3.3.5 for double precision
(3.3.5 is faster for complex; complex timings are not shown).
 100 200 300 400 500 600 700 800 900 1000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 1024.0 1512.4 1783.7 1846.1 1896.3 2076.8 1973.2 2084.6 2102.8 2104.8
3.3.4 dMM 1061.1 1524.1 1803.1 1927.5 1969.2 2029.2 2081.6 2072.3 2126.8 2135.6
 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 dMM 2112.8 2129.5 2192.0 2222.1 2180.5 2136.3 2189.1 2159.2 2236.1 2218.9
3.3.4 dMM 2155.3 2144.9 2171.5 2206.1 2205.7 2194.5 2220.9 2218.9 2223.6 2229.9
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ===== ===== ===== ===== ===== =====
3.3.2 d100 967.9 962.4 627.4 862.9 677.2 490.4
3.3.4 d100 1019.9 1153.2 710.1 891.6 732.5 636.8
3.3.2 d500 1889.3 1723.9 1452.0 1777.8 1514.8 1245.7
3.3.4 d500 1939.4 1729.7 1590.0 1718.1 1501.5 1402.7
3.3.2 d1000 2117.9 1917.6 1653.3 1935.7 1790.2 1526.1
3.3.4 d1000 2155.8 1823.7 1677.6 1932.1 1701.0 1528.4
 GEMV SYMV TRMV TRSV GER SYR SYR2
 ====== ====== ====== ====== ====== ====== ======
3.3.2 d500 122.4 225.6 113.4 109.4 39.2 47.3 61.0
3.3.4 d500 130.1 245.2 170.1 151.5 160.1 107.1 156.9
3.3.2 d1000 166.0 231.3 101.0 97.3 37.3 37.4 52.1
3.3.4 d1000 214.1 208.7 194.3 180.0 172.2 115.3 165.7
 ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
 ====== ====== ====== ====== ====== ====== ====== ====== ======
3.3.2 d500000 72.5 28.6 18.6 51.3 29.5 35.1 18.3 47.5 49.3
3.3.4 d500000 77.8 33.6 39.4 50.8 133.0 82.3 96.1 180.9 120.8
</PRE>
From: R C. W. <rw...@cs...> - 2001年09月13日 21:17:53
<PRE>
2001年7月30日 19:38:48
Vers: All atlas versions
Guys,
Had a user report some performance problems, saying he suspected the compiler.
Told him gcc has been doing pretty good for a long while, thought he probably
misinstalled. Tried the gcc given in RedHat 7.1 myself, got horrible results.
Same true of gcc 3.0.
You may have seen this as a feature of the new gcc 3.0:
 * New x86 back-end, generating much improved code.
What this appears to translate to for ATLAS is a roughly 50% performance
drop, at least on the Athlons. I don't have gcc 3.0 on a Pentium yet, but
the user reports a similar (though not quite as severe) degredation there.
I'd heard rumors about gcc 3.0 not doing as well for x86, but I certainly had
not imagined this . . .
By using different parameters, I managed to get code more like 65% of previous
performance. Perhaps there's a magic flag to turn all these great new
features off, but somehow I doubt it. I'm sending this out before I've got
all the facts in case others know something about it, or want to see if they
can figure it out. In the meantime, if you want performance, stick with
the old stuff, and if you play with it, any info appreciated . . .
Clint
</PRE>
From: R C. W. <rw...@cs...> - 2001年09月13日 21:16:11
<PRE>
Date: 2001年8月15日 11:11:07
From: co...@au...
Vers: ATLAS 3.3.2 on 733 Mhz G4e
I had a friend who owns a 733 Mhz G4e machine (older model, has 256kb 
L2 cache at full speed and 1 MB L3 cache at slower speed) run 
xsl3blastst -F 500 using the 3.3.2 developer release.
The 733 machine gets a peak rate 48% higher (2925 mflops vs. 1969 
mflops), with only a 37% higher clock rate. Now if only Motorola can 
make a 1+ Ghz G4e and make Altivec do double-precision...
First, the results on my 533 Mhz G4 (1MB L2 at 233 Mhz):
--------------------------------- GEMM ----------------------------------
TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.01 198.4 1.00 -----
 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1000.0 5.04 PASS
 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.08 196.8 1.00 -----
 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1209.8 6.15 PASS
 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.30 181.3 1.00 -----
 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.04 1278.9 7.05 PASS
 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.87 147.1 1.00 -----
 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.07 1920.0 13.05 PASS
 4 N N 500 500 500 1.0 1000 1000 1.0 1000 1.92 130.2 1.00 -----
 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.14 1724.1 13.24 PASS
 5 N N 600 600 600 1.0 1000 1000 1.0 1000 3.42 126.3 1.00 -----
 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.26 1661.5 13.15 PASS
 6 N N 700 700 700 1.0 1000 1000 1.0 1000 5.78 118.7 1.00 -----
 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.42 1633.3 13.76 PASS
 7 N N 800 800 800 1.0 1000 1000 1.0 1000 8.79 116.5 1.00 -----
 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.52 1969.2 16.90 PASS
 8 N N 900 900 900 1.0 1000 1000 1.0 1000 12.64 115.3 1.00 -----
 8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.78 1869.2 16.21 PASS
 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 17.38 115.1 1.00 -----
 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 1.11 1801.8 15.66 PASS
10 tests run, 10 passed
Now his machine:
--------------------------------- GEMM ----------------------------------
TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.01 247.5 1.00 -----
 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1351.4 5.46 PASS
 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.06 251.8 1.00 -----
 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1653.3 6.57 PASS
 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.27 203.3 1.00 -----
 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.03 1869.2 9.19 PASS
 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.72 177.8 1.00 -----
 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.05 2742.9 15.43 PASS
 4 N N 500 500 500 1.0 1000 1000 1.0 1000 1.63 152.9 1.00 -----
 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.10 2500.0 16.35 PASS
 5 N N 600 600 600 1.0 1000 1000 1.0 1000 3.44 125.6 1.00 -----
 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.17 2541.2 20.24 PASS
 6 N N 700 700 700 1.0 1000 1000 1.0 1000 6.10 112.5 1.00 -----
 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.29 2365.5 21.03 PASS
 7 N N 800 800 800 1.0 1000 1000 1.0 1000 9.54 107.3 1.00 -----
 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.35 2925.7 27.26 PASS
 8 N N 900 900 900 1.0 1000 1000 1.0 1000 13.92 104.7 1.00 -----
 8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.52 2803.8 26.77 PASS
 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 19.14 104.5 1.00 -----
 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.73 2739.7 26.22 PASS
10 tests run, 10 passed
-- 
Nicholas Coult, Ph.D., web: <A HREF="http://melby.augsburg.edu/~coult">http://melby.augsburg.edu/~coult</A>
Assistant Professor, Department of Mathematics, Augsburg College
co...@au..., phone: (612) 330-1064 office: Science Hall 137B
</PRE>
From: R C. W. <rw...@cs...> - 2001年09月13日 21:00:52
<PRE>
Date: 2001年4月10日 00:20:08
Vers: MKL5.0 vs. ATLAS 3.2.1 & 3.3.0 on 1.5Ghz P4
Guys,
Here are some timings comparing MKL5.0, ATLAS 3.2.1, and the ATLAS developer
release 3.3.0. All timings are on our 1.5Ghz P4 (256K L2). Note that the
developer release requires an experimental "as" to assemble the new SSE2
instructions; I was not able in the 5 minutes I spent on it to get this
rolling under cygwin/Windows 2000. Again, I'm timing under Win2K 'cause
the Linux version of MKL is under NDA. So, MKL5.0 and ATLAS 3.2.1 timings
were obtained under Win2K, while the ATLAS 3.3.0 timings were taken under
Linux **on the same machine**.
As with the PIII, MKL5.0 seg faults for 500x500 HERK HER2K, so that's why there
are no timings for that case.
The quickest summation I could give would be: just use ATLAS. 
The main difference between ATLAS 3.2 and 3.3 is, of course, support for
SSE2 using Camm and Peter's excellent kernels. I have not done full timings;
I settled for what I had time for, so perhaps MKL may be better on others,
but I think its P4 support is just too preliminary for that to be likely . . .
Cheers,
Clint
*******************************************************************************
* 1.5Ghz P4, 256K L2 *
*******************************************************************************
M50 : MKL5.0, Win2K
A32 : ATLAS 3.2.1, Win2K
A33 : ATLAS 3.3.0, Linux
 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
 ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
M50 dLU 676.4 681.2 677.9 685.7 690.5 690.0 691.7 694.6 695.3 698.1
A32 dLU 1045.7 1073.6 1077.1 1108.8 1109.1 1124.8 1130.2 1138.9 1137.9 1161.6
A33 dLU 1514.8 1562.7 1568.6 1619.3 1645.5 1677.6 1690.4 1720.1 1715.2 1722.1
M50 sLU 1741.7 1790.7 1840.5 1883.8 1861.5 1915.3 1999.8 1995.9 1977.1 1994.4
A32 sLU 2449.5 2571.5 2699.7 2812.1 2878.7 2977.9 3036.6 3094.0 3142.3 3191.8
A33 sLU 2449.5 2504.6 2624.4 2756.3 2851.0 2944.5 3060.8 3090.8 3126.2 3173.8
 100 200 300 400 500 600 700 800 900 1000
 ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
M50 sLU 556.9 1121.7 1346.6 1245.2 1468.4 1513.9 1522.8 1696.6 1728.1 1748.5
A32 sLU 527.6 917.8 1134.0 1520.9 1560.2 1917.6 1903.5 1994.2 2207.3 2220.6
A33 sLU 514.1 917.8 1134.0 1521.0 1664.2 1917.6 1903.5 2131.3 2207.3 2220.6
M50 dLU 384.8 531.3 567.0 606.6 622.5 639.2 650.8 642.2 664.3 672.2
A32 dLU 425.7 673.0 766.8 815.8 734.2 924.9 993.1 974.3 988.9 1023.3
A33 dLU 435.8 696.2 936.8 1120.7 1134.7 1150.6 1269.0 1311.6 1387.4 1448.2
M50 cLU 769.8 998.1 1024.4 1033.4 1037.6 1086.1 1100.1 1098.8 1114.9 1109.3
A32 cLU 554.9 912.6 1239.8 1543.0 1665.4 1856.9 2077.6 2162.7 2310.6 2377.9
A33 cLU 631.0 1064.7 1438.2 1623.9 1850.5 2055.9 2229.7 2313.0 2369.7 2445.6
M50 zLU 696.2 848.3 859.5 919.1 897.8 898.0 895.4 896.6 928.4 917.9
A32 zLU 504.8 709.8 826.6 897.4 945.0 992.5 1026.0 1040.2 1071.8 1077.5
A33 zLU 438.9 734.3 980.6 1136.7 1189.6 1338.7 1451.1 1436.5 1518.1 1532.0
M50 sMM 2500.0 2380.0 2454.5 3200.0 3125.0 2880.0 2982.6 3200.0 3095.5 2980.6
A32 sMM 2142.9 2880.0 3200.0 3605.6 4166.7 3570.2 3591.6 3923.4 3940.5 3913.9
A33 sMM 2631.6 2917.6 3240.0 4266.7 3846.2 3600.0 4035.3 4096.0 3940.5 3921.6
M50 dMM 952.4 1361.7 1350.0 1600.0 1562.5 1728.0 1591.6 1762.5 1672.0 1752.8
A32 dMM 952.4 1066.7 1148.9 1163.6 1184.8 1196.7 1222.8 1217.6 1245.1 1240.7
A33 dMM 1515.2 1600.0 1675.9 1920.0 2000.0 1878.3 1854.1 1969.2 1997.3 1941.7
M50 cMM 19.4 1113.0 1136.8 1216.2 1189.1 1190.1 1191.5 1202.9 1198.3 1199.4
A32 cMM 112.0 2115.7 2700.0 3938.5 4000.0 3918.4 3859.4 4129.0 4072.6 3994.0
A33 cMM 666.7 3200.0 3085.7 3938.5 4000.0 3840.0 3920.0 4137.4 4050.0 4060.9
M50 zMM 1000.0 1010.5 981.8 1064.4 1039.5 1071.3 1045.7 1067.8 1051.2 1052.5
A32 zMM 1047.1 1010.5 1200.0 1187.9 1161.4 1223.8 1212.0 1217.2 1208.5 1199.4
A33 zMM 1538.5 1920.0 1963.6 1897.3 1923.1 2057.1 2017.6 2133.3 2046.3 2088.8
 HEMM HERK HER2K
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ====== ====== ====== ====== ====== ======
M50 s500 3125.0 1785.7 1926.9 1243.8 2500.0 1785.7
A32 s500 3571.4 3571.4 2783.3 3571.4 3125.0 2272.7
A33 s500 3846.2 3571.4 2890.4 3571.4 3125.0 2381.0
M50 d500 1087.0 803.9 1138.6 889.7 1126.1 690.6
A32 d500 1000.0 1136.4 963.5 1136.4 1041.7 1041.7
A33 d500 1923.1 1923.1 1565.6 1785.7 1666.7 1470.6
M50 c500 1062.7 1040.6 1000.0 891.3 1019.3 1085.7
A32 c500 2849.0 3436.4 2947.1 3846.2 3128.1 1467.7
A33 c500 4000.0 3703.7 2783.3 3783.3 3336.7 2085.4
M50 z500 1019.4 960.6 **SEG FAULT** 925.1 892.2
A32 z500 1203.4 1189.1 961.6 1189.1 1040.5 1062.6
A33 z500 1851.9 1785.7 1565.6 1923.1 1725.9 1352.7
</PRE>
From: R C. W. <rw...@cs...> - 2001年09月13日 20:58:28
<PRE>
Date: 2001年3月28日 15:11:07
Vers: MKL5.0 vs. ATLAS3.2.0 on PII & PPRO
I include below MKL5.0 and ATLAS 3.2.1 timings on PII and PPRO platforms.
There's very little difference between the two, but MKL seems to be
better overall on these architectures.
The most interesting part of the timings for me is in confirmation of my
earlier theory on Level 1 performance. If you recall, I said I thought
MKL beat us badly on Level 1 for the PIII because of two factors: prefetch
and 1-cycle ABS.
PII and PPRO do not have prefetch, and we see that ATLAS now essentially ties
MKL for all the routines except NRM2, ASUM and AMAX, all of which use ABS . . .
Cheers,
Clint
*******************************************************************************
* 300Mhz PII, 512K L2, WinNT 4.0 *
*******************************************************************************
 100 200 300 400 500 600 700 800 900 1000
 ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
MKL dMM 194.2 192.0 192.2 195.1 197.5 194.7 201.4 202.3 199.4 200.9
ATL dMM 152.2 192.0 191.5 199.7 202.6 207.9 210.0 208.0 210.6 212.6
MKL sMM 228.6 240.9 241.1 241.1 242.5 242.6 245.3 247.3 246.0 247.6
ATL sMM 194.0 219.4 230.4 234.4 235.4 236.3 239.9 240.0 240.5 242.4
MKL cMM 15.4 245.8 246.9 248.3 247.1 248.5 247.3 249.2 247.8 234.2
ATL cMM 131.9 231.9 234.5 237.5 239.8 240.4 240.6 241.8 242.1 240.8
MKL dLU 111.4 151.0 164.1 165.1 166.4 170.4 172.0 176.1 176.6 177.6
ATL dLU 100.5 129.6 143.6 151.3 161.6 164.4 170.0 171.8 176.6 178.4
MKL sLU 128.1 170.0 191.4 188.0 190.4 200.0 197.6 202.1 200.5 203.0
ATL sLU 116.5 155.4 179.6 188.0 197.2 200.0 206.0 205.9 210.0 214.3
MKL cLU 181.8 209.8 214.0 218.3 219.7 222.0 223.3 224.0 223.7 225.1
ATL cLU 142.5 176.0 191.8 198.3 205.0 209.3 213.5 215.7 217.8 221.0
MKL zLU 159.9 181.6 183.9 188.2 190.3 193.9 195.0 196.3 197.1 198.4
ATL zLU 122.6 147.4 158.7 168.0 174.8 177.9 183.4 184.6 187.9 187.7
 HEMM HERK HER2K
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ====== ====== ====== ====== ====== ======
MKL s500 231.9 205.1 205.7 190.4 228.5 222.4
ATL s500 228.5 231.9 190.9 235.4 216.3 222.4
MKL d500 190.4 172.1 167.0 156.8 186.0 195.3
ATL d500 200.0 192.8 160.4 200.0 186.0 195.0
MKL c500 241.5 221.4 215.2 212.6 222.4 230.4
ATL c500 236.2 236.2 200.4 237.9 219.3 206.6
MKL z500 219.9 196.9 185.3 178.8 212.2 206.6
ATL z500 209.2 204.5 163.6 208.5 195.4 184.1
 HEMV GERU HER HER2
 GEMV SYMV TRMV TRSV GER SYR SYR2
 ====== ====== ====== ====== ====== ====== ======
MKL s500 68.4 68.4 64.0 65.6 34.7 34.1 56.4
ATL s500 72.3 81.5 67.3 64.0 34.7 37.0 57.6
MKL d500 37.2 48.7 36.1 36.1 19.5 19.8 39.7
ATL d500 34.2 56.2 33.3 32.4 20.4 19.6 34.3
MKL c500 97.3 90.4 83.2 96.2 65.7 57.1 111.0
ATL c500 97.3 121.6 92.5 83.4 56.6 55.8 84.3
MKL z500 78.6 81.1 52.2 73.4 38.6 35.3 61.1
ATL z500 59.5 86.9 56.7 55.5 33.4 31.7 53.1
 ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
 ====== ====== ====== ====== ====== ====== ====== ====== ====== 
MKL d500 34.6 10.5 32.0 7.5 15.1 18.6 141.8 160.0 67.3
ATL d500 33.7 10.4 33.7 7.7 14.2 18.8 9.6 31.2 25.6
*******************************************************************************
* 180Mhz PPRO, 256K L2, WinNT 4.0 *
*******************************************************************************
 100 200 300 400 500 600 700 800 900 1000
 ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
MKL sLU 77.1 97.0 104.7 113.6 118.2 119.6 120.8 124.7 125.3 126.9
ATL sLU 69.1 97.0 104.4 108.9 118.4 119.6 121.8 124.7 126.3 127.6
MKL dLU 69.1 85.0 100.0 108.9 108.8 109.6 113.3 115.5 117.7 119.8
ATL dLU 62.5 79.9 85.1 93.8 96.8 101.1 104.5 104.9 109.4 110.2
MKL cLU 107.3 123.8 128.0 130.0 130.8 133.5 135.1 135.6 136.4 138.3
ATL cLU 85.0 104.9 112.4 119.9 122.5 127.5 129.7 131.1 133.0 134.7
MKL zLU 97.2 109.2 112.4 117.3 119.8 122.4 124.2 122.0 125.6 125.7
ATL zLU 72.8 90.8 100.0 104.0 107.1 112.0 114.5 115.8 117.9 118.6
 HEMM HERK HER2K
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ====== ====== ====== ====== ====== ======
MKL s500 148.1 128.0 123.3 116.0 142.9 135.6
ATL s500 139.2 139.1 117.8 139.1 129.0 133.4
MKL d500 131.1 114.3 111.3 104.6 123.0 129.0
ATL d500 123.1 121.2 92.2 123.1 116.0 116.0
MKL c500 147.5 134.5 130.3 129.8 133.5 138.7
ATL c500 142.9 142.5 121.9 143.2 132.9 124.6
MKL z500 136.5 124.5 117.0 112.9 129.2 128.1
ATL z500 128.5 125.3 99.6 128.0 119.1 113.2
 HEMV GERU HER HER2
 GEMV SYMV TRMV TRSV GER SYR SYR2
 ====== ====== ====== ====== ====== ====== ======
MKL s500 48.0 48.1 44.2 44.2 24.5 23.5 43.2
ATL s500 48.1 56.9 45.7 42.6 24.5 23.9 40.4
MKL d500 25.5 32.9 24.2 24.6 14.5 14.1 27.3
ATL d500 27.8 35.7 25.6 24.6 14.4 14.1 23.6
MKL c500 72.1 60.7 55.4 67.7 46.2 43.7 83.0
ATL c500 60.7 76.7 61.0 55.3 41.3 42.2 55.2
MKL z500 60.7 55.0 35.8 52.8 30.3 24.4 50.3
ATL z500 33.9 60.7 33.8 32.1 25.1 24.0 38.6
 DOTU
 ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
 ====== ====== ====== ====== ====== ====== ====== ====== ====== 
MKL s500 37.6 12.5 14.2 11.0 16.0 23.8 53.5 58.1 30.4
ATL s500 37.7 12.3 12.5 8.8 16.0 24.6 5.7 17.8 14.9
MKL d500 18.8 6.3 7.3 5.7 9.0 13.1 35.6 35.6 22.1
ATL d500 18.3 6.3 7.4 4.6 8.9 13.1 5.5 13.9 12.3
MKL c500 12.6 33.7 11.2 33.7 40.0 91.7 58.1 32.1
ATL c500 12.3 31.9 8.8 30.5 40.0 11.4 17.8 14.5
MKL z500 6.3 22.1 5.6 16.8 22.8 70.9 37.6 23.7
ATL z500 6.2 18.3 4.5 17.3 26.7 11.0 13.6 11.0
</PRE>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<HR>
<!--X-Follow-Ups-End-->
<!--X-References-->
<!--X-References-End-->
<!--X-BotPNI-->
<UL>
<LI>Prev by Date:
<STRONG><A HREF="msg00238.html">developer schedule/misc.</A></STRONG>
</LI>
<LI>Next by Date:
<STRONG><A HREF="msg00240.html">MKL5.0 v. ATLAS3.2 v ATLAS 3.3 on the P4</A></STRONG>
</LI>
<LI>Prev by thread:
<STRONG><A HREF="msg00240.html">MKL5.0 v. ATLAS3.2 v ATLAS 3.3 on the P4</A></STRONG>
</LI>
<LI>Next by thread:
<STRONG><A HREF="msg00238.html">developer schedule/misc.</A></STRONG>
</LI>
<LI>Index(es):
<UL>
<LI><A HREF="maillist.html#00239"><STRONG>Date</STRONG></A></LI>
<LI><A HREF="threads.html#00239"><STRONG>Thread</STRONG></A></LI>
</UL>
</LI>
</UL>
<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</BODY>
</HTML>
From: R C. W. <rw...@cs...> - 2001年09月13日 20:56:06
<PRE>
Date: 2001年3月22日 22:20:47
Vers: MKL5.0 vs. ATLAS3.2.0 on 933Mhz PIII
Guys,
Some guys at Intel have been asking me to publish some ATLAS vs MKL numbers,
since most of my previous graphs compared against Greg Henri's BLAS. I used
to compare against Greg's BLAS 'cause MKL wasn't available under Linux, and
it is always a pain for me to get access to a Windows platform. MKL 5.1
is presently in BETA, and it has a Linux version. Since it's a BETA,
however, Intel requires you to agree to an NDA saying you won't publish
any benchmarks using it, and the Intel people have been unable to free
me from the NDA.
I've been working on the windows stuff lately, however, and once I figured
out how to call MKL, I was able to get numbers with MKL 5.0, which does not
have a no-publish NDA. Because I agree that comparing against Greg's stuff
is not the thing to do, I tried to do a fairly wide range of timings to 
clear the air here. I include these timings below.
If I had to summarize these PIII timings, it would be that ATLAS blows chunks
for Level 1 BLAS, tends to be beat MKL for Level 2 BLAS, and varies between
quite a bit slower and quite a bit faster than MKL for Level 3 BLAS, 
depending on problem size and data type.
The Level 1 results are easily explained. ATLAS's present Level 1 gets
its optimization mainly from the compiler. This gives MKL two huge
advantages: MKL can use the SSE prefetch instructions to speed up pretty
much all Level 1 ops. The second advantage is in how ABS() is done.
ABS() *should* be a 1-cycle operation, since you can just mask off the
sign bit. However, you cannot standardly do bit operation on floats in
ANSI C, so ATLAS has to use an if-type construct instead. This spells
absolute doom for the performance of NRM2, ASUM and AMAX.
For the Level 2 and 3, ATLAS has it's usual advantage of leveraging basic
kernels to the maximum. This means that all Level 3 ops follow the performance
of GEMM, and Level 2 ops follow GER or GEMV. MKL has the usual disadvantage
of optimizing all these routines seperately, leading to widely varying
performance.
For Level 2, ATLAS wins for pretty much all operations, sizes and precisions
other than DGER. ATLAS's success here is due mainly to Camm's excellent
prefetched Level 2 GEMV and GER kernels.
For the Level 3, we really have a mixed bag. ATLAS's main weakness is in its
complex TRSM. This is because TRSM cannot use the GEMM kernel as much as
the rest of the operations. Anytime TRSM runs slower than TRMM, this is
the reason. Complex is hit harder than real because I wrote a hand-tuned
kernel for real, while we must recur to 1 for complex. The fix for this
poor performance requires some theory that we don't yet have: details
of the problem are posted on the developer site, if anyone is interested.
ATLAS is also in general less good at small problems than MKL.
The main weakness of MKL in the Level 3 operations is in it's handling of
single precision complex, where it doesn't look like they have SSE
optimizations yet. MKL also tends to lose to ATLAS on pretty much everything
except GEMM for large problems.
For the factorizations, ATLAS tends to lose for small problems, and win for
large. In part, this is because we recur down to 1; I am hoping to include
LU and possibly LLt that stop the recursion before one in the next developer
release. Preliminary timings show this to make a large performance difference
for small problem sizes. For complex, the poor small-size TRSM performance
also has a definite impact, and a crushing one for LLt.
Cheers,
Clint
*******************************************************************************
* NOTES *
*******************************************************************************
All timings were taken on a 933Mhz PIII, 256K L2, under Windows 2000, using
MKL 5.0 and ATLAS 3.2.0.
The ATLAS timers were used: this may mean performance is less than with
other timers, as ATLAS flushes the data caches before each call.
For all timings, M=K=N, alpha=1.0, beta=1.0, Side='Left', Uplo='Lower', 
TRANS='Notrans', DIAG='Nonunit', except for the Level 1, where alpha=2.0 for
real, and (2.0, 2.2) for complex.
No timings are given for 500x500 HERK and HER2K for MKL, 'cause this call gave
an access violation.
MKL does not possess the Level 1 routines DSDOT and SDSDOT.
No timings are given for N=100 or 200 complex Cholesky, 'cause our timer
couldn't get enough accuracy to be repeatable.
There's a lot of other timings that could be done, but I'm unlikely to do them.
I will be posting the library I built to do these timings to the prebuilt page
(and it was just a standard ATLAS install, anyway, if you want to install
yourself), if other people would like to time further.
Timings either have problem size or operation along X axis. When problem
size is along the X axis, library (MKL for MKL 5.0, ATL, for ATLAS 3.2.0),
data type (S: single real, D: double real, C: single complex, Z: double complex)
and operation are given along Y. When operation is along the X axis, 
library, data type and problem size are given along Y.
LU is GETRF, LLT is POTRF.
Theoretical peak for double precision for this machine is 933 MFLOP. For
single precision using SSE (as both libraries do), theoretical peak is
3.732 GFLOP.
*******************************************************************************
* LEVEL 3 TIMINGS *
*******************************************************************************
 100 200 300 400 500 600 700 800 900 1000
 ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
MKL SGEMM 1327.7 1445.3 1400.6 1672.4 1584.3 1592.5 1661.6 1724.6 1675.9 1662.5
ATL SGEMM 911.6 1359.5 1347.4 1492.8 1502.4 1543.9 1544.5 1569.3 1599.9 1610.3
MKL DGEMM 640.2 648.4 648.0 664.4 680.3 673.9 697.2 704.7 691.3 699.5
ATL DGEMM 551.9 622.3 635.3 646.5 653.6 673.9 665.4 682.7 675.9 677.0
MKL CGEMM 773.8 818.8 766.0 819.2 810.4 825.6 820.6 829.5 825.7 825.8
ATL CGEMM 1094.9 1449.1 1542.9 1561.0 1524.4 1556.8 1554.7 1588.8 1595.2 1610.0
MKL ZGEMM 610.8 664.4 692.3 745.3 727.3 747.4 734.9 753.4 737.6 740.9
ATL ZGEMM 599.0 647.6 727.9 668.4 681.2 682.7 683.3 682.7 688.6 690.0
MKL SLU 477.8 751.1 846.4 839.1 810.1 837.4 812.9 909.4 887.7 906.3
ATL SLU 385.7 633.3 748.1 860.3 931.9 995.3 1019.7 1064.0 1109.9 1152.5
MKL DLU 366.5 462.0 475.6 487.3 484.7 497.6 504.2 519.0 518.2 526.6
ATL DLU 337.5 430.5 459.4 504.6 514.7 525.9 541.3 560.0 555.0 568.4
MKL CLU 606.4 667.2 644.5 641.8 646.1 669.3 664.9 682.3 690.8 696.4
ATL CLU 459.0 681.4 768.5 910.2 969.7 1052.4 1083.1 1134.4 1173.4 1201.3
MKL SLLT 288.4 459.2 568.5 644.8 683.2 753.0 763.9 782.0 779.3 821.2
ATL SLLT 244.3 407.1 530.0 632.1 730.0 808.4 833.9 887.4 953.3 970.4
MKL DLLT 298.5 416.3 428.9 442.4 461.3 461.7 473.4 775.6 486.8 496.8
ATL DLLT 256.5 348.6 403.9 428.3 445.5 478.0 505.9 508.9 501.9 534.1
MKL CLLT 585.0 613.0 629.4 616.4 639.0 635.1 642.8 648.1
ATL CLLT 585.0 686.5 715.5 840.3 862.4 912.8 959.1 983.3
MKL ZLLT 465.0 550.1 695.8 597.3 599.7 616.7 629.9 638.2
ATL ZLLT 385.9 456.5 466.3 499.3 506.4 527.8 537.8 524.7
 HEMM HERK HER2K
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ====== ====== ====== ====== ====== ======
MKL S100 1362.4 581.7 504.0 414.8 800.0 711.2
ATL S100 941.6 1049.3 688.2 912.3 598.1 542.3
MKL S500 1560.1 1000.0 1079.7 959.1 1453.5 901.7
ATL S500 1524.4 1422.5 1144.6 1500.0 1305.5 1102.5
MKL S1000 1662.5 1163.5 1256.0 1142.9 1560.1 1033.1
ATL S1000 1600.0 1524.4 1334.7 1600.0 1455.6 1333.3
MKL D100 640.0 419.7 376.3 326.5 569.0 512.2
ATL D100 556.3 543.0 400.0 541.5 473.9 465.7
MKL D500 693.4 551.9 572.8 551.9 648.8 545.1
ATL D500 666.7 615.8 522.6 666.7 600.0 600.0
MKL D1000 699.3 606.6 621.7 598.1 666.7 566.6
ATL D1000 688.2 656.4 587.4 666.7 639.8 639.8
MKL C100 771.0 533.3 487.5 492.1 522.8 607.9
ATL C100 1067.2 1033.1 608.1 1023.8 725.1 425.3
MKL C500 810.4 718.9 712.7 703.2 727.5 728.5
ATL C500 1522.1 1488.1 1187.2 1488.1 1334.7 1001.0
MKL C1000 825.8 756.3 753.6 748.6 778.4 740.2
ATL C1000 1605.1 1585.1 1370.8 1475.5 1515.3 1249.5
MKL Z100 656.8 473.9 441.0 411.3 579.3 620.9
ATL Z100 609.8 595.2 457.1 597.6 462.8 392.1
MKL Z500 718.9 646.8 681.0 653.4
ATL Z500 681.2 659.6 553.0 681.2 616.4 582.7 
MKL Z1000 725.2 683.6 692.6 679.0 719.5 672.3
ATL Z1000 689.1 678.1 625.0 681.8 660.1 638.7
*******************************************************************************
* LEVEL 2 TIMINGS *
*******************************************************************************
 HEMV GERU HER HER2
 GEMV SYMV TRMV TRSV GER SYR SYR2
 ====== ====== ====== ====== ====== ====== ======
MKL s100 253.3 178.8 230.2 223.8 155.4 96.4 164.1
ATL s100 301.7 323.2 176.8 175.9 188.2 163.3 246.2
MKL s500 211.9 183.9 175.8 227.0 165.0 101.6 191.6
ATL s500 340.6 463.8 227.0 223.8 192.8 172.1 283.1
MKL s1000 319.0 215.5 201.2 301.8 173.3 105.6 195.9
ATL s1000 414.2 358.5 340.4 333.3 185.4 174.9 273.0
MKL d100 202.5 146.8 193.9 205.2 97.9 83.8 118.9
ATL d100 186.1 145.4 89.9 86.0 74.8 70.0 108.1
MKL d500 166.7 151.7 122.6 178.8 100.9 63.6 115.1
ATL d500 203.8 192.8 157.6 159.2 77.9 67.9 119.0
MKL d1000 167.8 152.6 123.1 173.0 100.0 65.6 117.9
ATL d1000 208.4 189.8 176.8 176.8 76.3 71.6 121.0
MKL c100 381.1 266.6 200.0 228.6 301.9 177.7 271.1
ATL c100 695.4 615.7 355.6 296.2 323.2 275.9 421.2
MKL c500 414.6 275.2 249.7 363.3 312.9 190.3 282.3
ATL c500 693.7 693.7 581.5 570.9 322.4 307.4 419.9
MKL c1000 429.0 276.4 258.2 397.0 311.4 182.9 284.1
ATL c1000 706.1 676.3 635.4 622.6 314.4 300.3 408.2
MKL z100 268.9 203.8 130.6 154.6 155.3 102.2 191.6
ATL z100 380.8 304.9 205.1 189.3 192.7 169.3 225.3
MKL z500 303.9 208.7 153.7 253.7 158.8 106.2 198.1
ATL z500 375.6 301.0 316.5 307.4 186.7 178.6 234.6
MKL z1000 317.8 207.7 159.6 294.2 160.4 105.4 195.2
ATL z1000 373.8 305.5 345.1 341.5 177.5 174.8 224.2
*******************************************************************************
* LEVEL 1 TIMINGS *
*******************************************************************************
 DOTU
 ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX
 ====== ====== ====== ====== ====== ====== ====== ====== ====== 
MKL s500 246.3 118.5 76.1 114.2 106.6 168.9 276.4 267.4 357.1
ATL s500 152.4 53.3 11.8 56.1 69.6 94.2 26.0 65.3 57.1
MKL d500 168.3 59.3 71.0 54.2 59.2 145.8 145.8 290.7 213.7
ATL d500 82.1 44.4 38.1 30.8 54.2 91.4 22.7 44.4 40.0
MKL c500 110.1 21.8 110.4 188.0 320.5 641.0 320.5 400.0
ATL c500 53.3 20.8 56.1 138.9 145.3 52.5 66.7 57.1
MKL z500 57.1 118.5 60.3 103.3 127.9 454.5 228.3 228.3
ATL z500 44.4 83.2 30.8 78.0 118.5 45.7 43.8 38.1
</PRE>
From: R C. W. <rw...@cs...> - 2001年09月13日 20:53:00
<PRE>
Date: 2001年1月18日 12:56:47
Vers: ATLAS 3.2.0
OK, I include below the updated P4 timings. I went ahead and timed all 4
types/precisions, and threw in Cholesky as well. I've kept my original
timings (indicated by P40 instead of P4) for comparison. In particular, we
see that the large blocking factor for SSE provides much better GEMM
performance, but that LU is slowed down until we get to very large cases
(~2000), probably due to the inadequacy of the cleanup (if you can call
3Gflop LU inadequate :) . . .
For the larger problem sizes, we see that zMM loses performance. This is due
to running out of memory, with ATLAS having to use less and less workspace
(which causes more and more cache thrashing), until around 2400, where
swapping sets in.
The relatively terrible performance of TRSM for the SSE-enabled code is
because accuracy prevents us from inverting diagonal blocks, and using
a gemm-based kernel, and thus we have to drop to using the x86 FPU (with
it's associated 1/4 theoretical peak) for that part of the computation.
Cheers,
Clint
 100 200 300 400 500 600 700 800 900 1000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40 dMM 952.4 1010.5 1080.0 984.6 1041.7 1080.0 1055.4 1077.9 1088.1 1075.3
P4 dMM 1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
P40 dLU 435.8 611.8 718.2 788.6 805.2 821.8 878.5 874.4 882.9 888.2
P4 dLU 428.7 659.8 763.1 851.7 887.6 933.9 951.8 974.3 1033.2 1040.9
P4 dLLt 400.1 547.1 615.1 713.8 759.8 820.2 929.0 876.9 918.5 1011.6
P40 sMM 2500.0 3674.1 3240.0 3584.0 3571.4 3600.0 3811.1 3657.1 3645.0 3703.7
P4 sMM 2631.6 3100.0 3351.7 3895.7 4000.0 3756.5 3811.1 4096.0 4050.0 4000.0
P40 sLU 606.5 1153.8 1529.5 1703.5 1808.9 2054.6 2284.2 2200.1 2428.0 2467.3
P4 sLU 537.8 921.9 1182.9 1525.5 1664.2 1830.4 2003.7 2087.8 2174.3 2337.4
P4 sLLt 430.4 727.8 753.8 1162.4 1266.4 1546.7 1697.5 1832.0 1872.3 1963.7
P4 zMM 1184.0 1163.6 1200.0 1219.0 1219.5 1216.9 1219.6 1219.0 1212.5 1201.2
P4 zLU 521.0 749.2 846.0 897.4 951.7 992.5 1027.2 1041.8 1056.1 1062.1
P4 zLLt 604.5 780.1 837.1 876.0 899.6 889.1 928.0 928.0
P4 cMM 2755.6 2968.7 3085.7 4266.7 4000.0 3927.3 2976.8 4137.4 4107.0 4060.9
P4 cLU 636.8 986.8 1232.7 1598.5 1708.1 1856.9 2031.5 2201.1 2341.2 2423.3
P4 cLLt 1078.7 906.8 1716.3 1674.2 1927.2 1911.7 2013.5 2118.3 2121.2
 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
 ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
P40 dMM 1066.7 1067.7 1066.7 1065.3 1071.0 1073.9 1073.3 1072.7 1073.4 852.3
P4 dMM 1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
P40 dLU 906.5 932.8 937.9 950.2 955.4 965.5 969.8 977.8 975.4 986.1
P4 dLU 1046.6 1075.5 1091.8 1107.2 1110.7 1128.2 1126.3 1130.7 1139.5 1158.0
P4 dLLt 961.2 1017.4 1035.3 1074.9 1050.7 1089.5 1072.3 1081.6 1097.6 1108.9
P40 sMM 3676.6 3683.2 3673.5 3679.5 3661.3 3665.4 3671.7 3657.9 3670.9 3658.5
P4 sMM 4114.3 4126.3 4137.4 4107.0 4166.7 4127.1 4182.8 4140.4 4189.3 4150.7
P40 sLU 2616.5 2688.8 2785.1 2922.1 2913.3 2994.2 3040.6 2074.5 3093.2 3118.8
P4 sLU 2398.5 2539.4 2702.4 2796.0 2961.8 2932.3 3050.7 3090.8 3153.2 3168.2
P4 sLLt 2306.9 2543.5 2484.8 2560.0 2668.7 2732.1 2828.8 2916.4 2940.3 3062.8
P4 cMM 4065.9 3991.3 4030.5 3957.3 4015.1 3806.3 3818.8 3773.7 3063.2 2544.2
P4 cLU 1712.5 1750.1 1835.3 1893.9 1930.3 1972.9 2016.3 2030.4 2064.6 2092.8
P4 zMM 1189.7 1159.6 1159.1 1157.1 1156.0 794.2 ...........................
P4 zLU 872.5 806.5 772.3 763.3 737.8 714.9 691.4 680.9 674.4 670.6
 GEMM SYMM SYRK SYR2K TRMM TRSM
 ===== ===== ===== ===== ===== =====
P40 d500 1041.7 961.5 835.0 1000.0 892.9 1041.7
P4 d500 1209.7 1171.9 1002.0 1209.7 1056.3 1056.3
P4 z500 1204.8 1190.5 1043.8 1204.8 1042.7 981.4
P40 s500 3571.4 3125.0 2636.8 3333.3 2941.2 2500.0
P4 s500 3571.4 4166.7 3006.0 3571.4 3125.0 2419.4
P4 c500 4000.0 3846.2 3131.3 4166.7 3575.0 1787.5
</PRE>

Showing results of 26

1 2 > >> (Page 1 of 2)
Thanks for helping keep SourceForge clean.
X





Briefly describe the problem (required):
Upload screenshot of ad (required):
Select a file, or drag & drop file here.
Screenshot instructions:

Click URL instructions:
Right-click on the ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)

More information about our ad policies

Ad destination/click URL:

AltStyle によって変換されたページ (->オリジナル) /