else Mjoin(PATL,col2blk_a1)(K, M, A, lda, pA, alpha);to:
else Mjoin(PATL,col2blk2_a1)(K, M, A, lda, pA, alpha);(i.e., col2blk becomes col2blk2).
If you have already built the library before applying this fix, you can force a recompile by doing a make xcl3blastst xzl3blastst in ATLAS/bin after making the fix.
Large blocking factors hurt LAPACK performance for small N
If you build the combined lapack/ATLAS library, the present ILAENV always
uses ATLAS's GEMM blocking factor for all routines. This causes a slowdown
in the cases where the minimum dimension of the calling routine is small
compared to NB. For instance, on the Itanium, QR factorization will run at
the speed of GEMV until something like N > 500. To fix this problem,
save this file over the top of your present
ATLAS/interfaces/lapack/F77/src/ilaenv.f file. If you have already
built ATLAS and the combined LAPACK/ATLAS library, put this file into your
library in a fashion similar to this:
cd ATLAS/interfaces/lapack/F77/src/ f77 -O -c ilaenv.f ar r /path/to/lapacklib/liblapack.a ilaenv.o
ATLAS threaded performance dies for large
problems
On some platforms, if you time large problems, you'll see that ATLAS's
threaded library does well, and then suddenly drops to below serial
performance. This is probably an error in how we recover from lack of
memory in the threaded case, but in the meantime the easy fix is to increase
the total amount of memory ATLAS is allowed to allocate. To do this, edit
the file ATLAS/include/atlas_lvl3.h, and pump up the macro
ATL_MaxMalloc, which is the maximal size (in bytes) ATLAS is allowed to
allocate. It is presently set to be 16MB (or less, for older releases);
make it as large as you think you can afford.
Search doesn't work with cpu throttling.
By default, newer Linuxes (and probably other OSes) have CPU throttling
turned on even for desktops
in order to save power. Since the speed of your CPU is constantly changing,
the ATLAS timing results become essentially meaningless. Therefore, to get
good performance numbers (and thus a fast ATLAS library), be sure to turn
off cpu throttling in your BIOS before installation. If you are using
the machine for high performance code a lot, you may want to leave it off.
You can also usually turn off cpu throttling in your OS, but this varies
by OS. The file ATLAS/INSTALL.txt from the latest developer release
has further details.
Problems installing on a G5.
3.6 does not have explicit support for the G5, but the developer releases,
starting with 3.7.10, have full G5 support, including 32 bit OS X, and 32 and
64 bit Linux support. I recommend you use a developer release for a G5
install.
Apple's cpp prevents install on newer G4 and G5
For newer versions of Apple's modified gcc, you need to add the flag
-no-cpp-precomp to all your various C flags. If you still have
problems during installation using Apple's modified gcc, your best bet
at the moment is to install the newest gcc from gnu and use it instead.
Some directions for installing gcc without being root are given
here, though note that unlike x86 users,
you want the newest gcc.
Misdetection of nregs causes some x86 installs to go awry
Using later gcc compilers, ATLAS's mmsearch may misdetect that you
have more than 15 registers available, which can cause it to skip searching
the most optimal cases. To fix, add the following lines to line 824 of
ATLAS/tune/blas/gemm/mmsearch.c before your install:
#if defined(ATL_GAS_x8632) || defined(ATL_GAS_x8664) return(0); #endifNote that this problem will not affect you if are using the architectural defaults.
make kill from topdir removes libraries
The documentation says "make killall" from TOPdir removes all
architecture-specific subdirectires, and "make kill" removes all
except the libraries. Actually, due to an error, they both kill everything.
To fix, edit ATLAS/Make.top, and delete line 215, which is:
rm -rf lib/$(arch)
Error in UltraSPARC cleanup
There is an error in the cleanup code for the UltraSPARC processors, that can
cause segmentation faults when [d,z]GEMM is called with K mod NB = 8. To fix,
change line 275 of ATLAS/tune/blas/gemm/CASES/ATL_dmm4x4x8_US.c from
#if (KB != 2)to:
#if (KB != 8)before installation. If you have already installed, make the change, and then issue the following commands (substituting your arch string for ARCH, and giving full path where indicated):
cd ATLAS/tune/blas/gemm/ARCH rm [d,z]install make dinstall make zinstall cd ATLAS/bin/ARCH make dmmlib zmmlib dl3lib zl3lib
-O -fomit-frame-pointer -fno-schedule-insns -fno-schedule-insns2to:
-mcpu=ultrasparc -mtune=ultrasparc -O -fomit-frame-pointer -fno-schedule-insns -fno-schedule-insns2
atlas_prefetch.h won't compile using Sun CC
v 5 or earlier.
In order to use prefetch with Sun's cc, ATLAS includes the Sun header file
sun_prefetch.h, which did not exist until Sun CC version 6.
My guess is that without prefetch on the UltraSparc III, you are better off
using gcc to compile your stuff than cc, and that the cc arch defaults are
probably not good. However, if you want to use Sun CC anyway, you will need
to modify atlas_prefetch.h. I haven't had time to scope this problem
myself (and I'm not sure I have access to a machine with an old enough
Sun CC anyway), but here's a fix the user who originally found the error
mentioned:
Change line 50 of ATLAS/include/atlas_prefetch.h from:
#elif defined(__SUNPRO_C) && defined(__sparc)to:
#elif defined(__SUNPRO_C) && defined(__sparc) && __SUNPRO_CC >= 0x600If you choose to do this, you will probably want to say no to arch defaults. I would also install using gcc, and compare performance to see which is better.
Assembler renaming problem for Windows machines
This problem primarily affects PIIISSE1 and ATHLONSSE1 installs under Windows,
but any Windows user should apply it. To fix,
change lines 94-95 of ATLAS/tune/blas/gemm/CASES/ATL_smm6x1x60_sse.c
from:
.global ATL_USERMM
ATL_USERMM:
to:
#ifndef Mjoin
#define Mjoin(pre, nam) my_join(pre, nam)
#define my_join(pre, nam) pre ## nam
#endif
#if defined(ATL_OS_WinNT) || defined(ATL_OS_Win9x)
#define ATL_AUSERMM Mjoin(_,ATL_USERMM)
#else
#define ATL_AUSERMM ATL_USERMM
#endif
.global ATL_AUSERMM
ATL_AUSERMM:
and then install as normal.
String overrun in config for long compiler paths
Thanks to Yozo Hida for noticing that compilers with path length &ge 64 can
cause a memory overwrite in config. To fix, Change line 3414 of
ATLAS/config.c from:
char comp[64], cflg[512], ln[512], tnam[256], archdef[256], mmdef[256];to:
char comp[512], cflg[512], ln[512], tnam[256], archdef[256], mmdef[256];
This is normal, and not an error. Let me translate this message out of
gnu-speak:
For maximal compatability, ATLAS hews to the ANSI/ISO 9899-1990 standard,
and so I cannot make the proposed change. Unfortunately,
this warning message is literally immortal: there exists in gcc no flag
combination that I can discover that can turn the freaking thing off.
So, every time you link one of the config programs that calls this standard
routine, the linker outputs this message, even if you turn on the strict
ANSI compatibility flag. I reported it as an error to the gcc folks,
but they point out it is the linker/glibc people that generate the "warning"
and immediately closed the tracker. Still seems wrong to me that the
strict ANSI flag with all warning messages turned off insists on printing
out a message warning about standard usage, but there appears to be little
for me to do about it. Therefore, just ignore the hectoring,
and don't worry about these immortal, bogus, annoying and repetitive
"warnings".
Examples always help. Here's the config line if I want to use Intel's compilers
to build my Pentium 4 library:
You can also override the compiler to one that ATLAS does not know about,
in which case you are likely to get garbage for your flags unless you
override them as well. You will then have to compile without architectural
defaults, which is likely to produce slower libraries, and guaranteed to
produce a much longer install time. But, hey, feel free.
If you override the flags as well as the compiler, be aware that differing
flags using a supported compiler may well decrease the performance of the
library.
Windows users have a few windows-specific options,
as explained here.
After this, you'll still get failures reported (in our last run on
Linux_PIIISSE1, we had failures in cnep.out, csep.out,
and ced.out), but these failures will happen with stock LAPACK and
the F77BLAS.
Another problem that could cause this is that ATLAS misdetected the peak
of your machine, and is thus using an inadequate timing interval. You
can see if this is happening by scoping how long each timing is taking.
If it is very quick, and thus unrepeatable, you need to tell ATLAS to pump
up the timing granularity. To do this, edit the files
ATLAS/include/ARCH/atlas_?sysinfo.h. Each of these four files
will have a quantity called ATL_nkflop. Pump this quantity up
by some significant factor until timings are regular. I usually increase
it by a factor of 5 or 10. If the individual timings are then too slow,
interrupt the process, and decrease the values.
Finally, the Level 1 timings very often display this problem even when
the timing interval is sufficient. The most likely explaination of this
non-repeatable timing problem involves inadequate cache flushing, but it
has not been tracked down for sure. Regardless, the only way is to keep
restarting the interrupted intall until it completes, as explained
below.
Most of the time, when an install dies in this way, you can just
restart it, as outlined here. If this dies
right away in the exact same timing, but without actually running
the timing again, it means that the install process kept a record of
the bad timing, and is just rereading it. You then need to remove the
bogus timing record file. This file will be in the appropriate architecture
directory under the res/ subdirectory. For instance, if you are
dying in the level 1 tuning, the result files are stored in
/ATLAS/atlas3.4/ATLAS3.4.1/tune/blas/level1/ARCH/res,
and if you are in the gemm tuning they are in
/ATLAS/atlas3.4/ATLAS3.4.1/tune/blas/gemm/ARCH/res, etc. Your
last message from the dying install should give you most info you need
to figure out the result directory and the file. Just remove the file
(or all the files of that precision, if you cannot figure out the
specific file that is bad), and restart the install.
To install with a non-default f77 compiler, simply override the
default fortran compiler and flags from the command line when running
config as explained here.
If you want to install ATLAS so it can be called from multiple,
non-interoperable Fortran compilers (or indeed, have already installed
with the wrong f77 compiler), you can do this with moderate
ease, assuming you know how C and the given F77 compiler(s) interoperate.
If you do not know this interoperational information, you must get
config to find it for you. To do this, in your ATLAS/ directory,
run config again overriding the default fortran compiler and flags from
the command line as explained here. You will
also want to use a different architecture name so as not to overwrite
your good Make.ARCH, and tell config not to create all the
unneeded architecture subdirectories (since you won't be using this
Make.ARCH anyway). You can do this from the command line
by adding something like -a F2C_BOGUS -D 0 to the ./xconfig
command. You can then look at the generated
Make.F2C_BOGUS's F2CDEFS for the appropriate settings,
and replicate them, along with the new F77 compiler/linker information,
into your original Make.ARCH.
For those user's already aware of the information needed for C/F77
interoperation,
ATLAS needs three pieces of information in order to correctly handle
F77/C interoperation, and this information appears as defines to the
C compiler, set in your Make.ARCH's F2CDEFS.
The first macro controls the name space alterations necessary to make a
C routine callable from Fortran77. The options are:
The second macro provides a mapping between F77's INTEGER
and the appropriate C integral type. Options are:
The third macro deals with F77 string handling. The options are:
By default, ATLAS builds the F77 interface to the BLAS into the file pointed
at by Make.ARCH's F77BLASlib, and so changing this macro before
recompiling the interface will allow you to build multiple F77 interfaces.
For example, say on a Solaris machine I want to build the f77 interface
for both Sun's f77 and g77. First, I install ATLAS as normal, with the
default f77 compiler. Now, to get a g77 interface lib, I edit my
ATLAS/Make.SunOS_SunUS2, and I find that ATLAS has detected the C/F77
interface for Sun's f77 compiler as:
Finally, I change the f77 compiler/linker information from:
You can essentially repeat this process for the LAPACK F77 interface, but
change LAPACKlib rather than F77BLASlib, and go to
ATLAS/interfaces/lapack/F77/src/SunOS_SunUS2
rather than
ATLAS/interfaces/blas/F77/src/SunOS_SunUS2. Also, LAPACK does not
have a separate entry point for threads, so do not issue any of the additional
threading instructions.
Finally, in your ATLAS/src/testing/ARCH directory, issue :
For each platform, ATLAS defaults to using the fastest available compiler.
If the two compilers deliver roughly the same ATLAS performance, we then
pick the one we think is most standard for users of that platform, and if
we are unsure, we pick gcc/g77, since they are freely available.
Here's a small table outlining some of ATLAS's present
architectural default compiler support:
You can vary the compiler config selects as described
here.
In this table, COMP1 is the compiler ATLAS defaults to using if no arguments
are passed to config. FASTER indicates if the default choice is faster, or
roughly the same speed as the secondary choice. For some archs, we specify
ranges of acceptable compilers, where we know they get good performance.
For others, we just list what compiler version was used in generating the
defaults.
If a range is noted, make absolutely sure you use a compiler in this
range, as failure to do so may cut your performance in half.
As of this writing, gcc 3.2 or greater works well on all architectures.
gcc 2.95 works just as well for x86 archs, if that's what you have installed.
For x86 platforms, you will get pretty much the same performance if you choose
to install with Intel's icc compiler. On the Itanium 2, however,
we highly recommend using icc, as it provides a significantly faster ATLAS
library than gcc. We do not have access to an Itanium 1 any longer, so were
unable to add icc architectural defaults. It is likely it would get better
performance from icc, though.
For x86, gcc 2.95.x gcc is substantially better than Portland Group's pgcc
(the worst compiler to compile ATLAS with), MSVC++, or Watcom C. If you wish
to use these compilers with your own code, they interoperate with gcc if the
correct flags are chosen. We suggest you compile ATLAS with gcc, and
then link with your compiler of choice.
For UltraSparcs, gcc and cc are roughly the same, with cc having a slight
edge, particularly for Level 1 and 2.
The Dec/Compaq alpha compiler is very good for most codes, but not so good
for compiling ATLAS's matmul kernels. It has some optimizations that consume
resources that cannot be turned off; in many cases, ATLAS is actually optimal
already, so these optimizations prevent ATLAS from getting good performance.
Just linking in ATLAS's liblapack.a first will not get you the best LAPACK
performance, mainly because LAPACK's untuned ILAENV will be used instead
of ATLAS's tuned one. So, if you use any LAPACK routine that is not
provided by ATLAS, it is essential that you create this hybrid LAPACK/ATLAS
library in order to get the best performance.
What you want to do is tune tune CachEdge, as shown here,
but be sure to use very large problem sizes in order find CacheEdge.
In ATLAS/tune/blas/gemm/ARCH, issue make xdfindCE. Run
You want to run
this program several times to get a consensus idea of what a good setting
would be. If a CacheEdge setting gets performance in the same range as
no CacheEdge (CacheEdge of 0 is no CacheEdge in printout of xdfindCE),
it is still recommended that you use that setting, since ATLAS with
CacheEdge set will use less memory as problem sizes grows.
Once you have gotten an idea of what to set CacheEdge to, you can change it by
editing ATLAS/include/ARCH/atlas_cacheedge.h. xdfindCE
prints out data in KB, but atlas_cacheedge.h needs bytes, so multiply
the xdfindCE result by 1024 to get the number you want to use in
atlas_cacheedge.h.
Let's take an example. Say xdfindCE printed out this:
By successively editing this file and recompiling, for instance
ATLAS/bin/ARCH/x[d,s,z,c]mmtst you can tune this value further.
Many users expect that they should set CacheEdge to the actual size of their
L2 cache. This is only rarely the best setting, mainly because L2 caches
are normally combined data/instruction, and so a smaller setting,
leaving room for instruction caching, is usually best. On some machines
with large L2 caches, things like associativity, or even TLB issues, can
make it more efficient to use a very small subset of the available cache.
Here are some CacheEdge settings that the ATLAS team has chosen:
NOTE: these are out of date!
Applying a patch to your ATLAS directory
To apply an ATLAS patch to your existing directory tree, save the patch
file (we will call it patchfile from here on out) in your
ATLAS/ directory, and then issue:
patch -p1 < patchfile
Unkillable and relentless
'warning: the use of `tmpnam' is dangerous' warnings from gcc.
During config you will get a lot (and if using a developer release, even
more) warnings of the following form:
/tmp/ccq5b8sE.o(.text+0x852): In function `CmndResults':
config.c: warning: the use of `tmpnam' is dangerous,
better use `mkstemp'
Hey, idiot, would you stop using that pesky ANSI/ISO C standard and
use this non-standard routine instead?
Overriding config's default compiler info from the
command line.
For some systems, ATLAS actually has good architectural defaults for more
than one set of compilers. ATLAS usually defaults to the best performing
set, as explained here. However, if you want
to use a secondary compiler, config will fill in the flags and so on if you
specify the compiler. Config will print a brief usage message
if you make xconfig ; ./xconfig --help.
make xconfig ; ./xconfig -m icc -c icc -f ifort
Here's the command to get Intel fortran compiler, but gcc for all C routines:
make xconfig ; ./xconfig -m gcc -c gcc -f ifort
On UltraSparc's, ATLAS defaults to using Sun's cc/f77. If you want gcc/g77
instead:
make xconfig ; ./xconfig -m gcc -c gcc -f g77
What should I do with my ev7/21364?
During config, say it is a ev6/21264 when prompted for the architecture.
These architectural defaults get good performance on the ev7 (the ev7 uses
the same core as the ev6).Testing ATLAS with the LAPACK testers.
It takes a bit of hoop-jumping to get ATLAS to pass the LAPACK testers.
First, the LAPACK testers have an error in them that causes them to flag the
ATLAS TRSM as bad. This is explained briefly in the
LAPACK errata.
You can either ignore the incorrect errors generated by lapack in the
xlintst? testers, or you can override ATLAS's TRSM in order to make them
go away. To override, edit LAPACK's make.inc, and link in the
F77 [c,s,d,z]trsm.o in BLASLIB before the ATLAS libs.Threaded code incorrect on alpha systems
The threaded code hangs on both OSF1/Tru64 and Linux OSes running on
Dec/Compaq Alphas. The odd thing, is so does the previous stable release,
3.2, which used to run fine on the same machines. Considerable effort was
spent attempting to figure this out, to no avail.Your install dies with "unable to get timings in tolerance"
This means that ATLAS could not get repeatable timings. There are several
things that could cause this to happen. This could occur if the machine is
heavily loaded or experiences a sudden surge in usage from another program,
for instance. If this is the problem, simply keep restarting the install
(as discussed below) until it finishes.Install dies in tfc / ?Xover.h is incomplete
Change line 58 of ATLAS/tune/blas/gemm/tfc.c, from:
#define MAXALLOC (3*1024*1024*8)
to:
#define MAXALLOC (8*1024*1024*8)
and then restart your install from scratch.ATLAS IA64 performance cut by more than 1/4 using Red Hat gcc
Using Red Hat's gcc 2.96-ia64-000717 compiler, ATLAS performance is decreased
by almost a factor of five over previous or later compilers. I'm not sure
what all versions of Red Hat are effected, so if you experience very poor
ATLAS performance, the best solution appears to be to install gcc 3.2.ATLAS build dies on Red Hat 7.0 and/or gcc 2.9[6,7]
Red Hat 7.0 shipped with a version of gcc not supported by GNU (GCC 2.96 and/or
2.97). It contains error(s) causing the ATLAS build to fail. Redhat has
released a patch fixing the problems in the RH7.0 version, available
here.
If this doesn't work for you,
the recommended fix for the problem is to install gcc 2.95.3 or 3.2.Installing with a non-default f77 compiler
The only Fortran routines in ATLAS are the Fortran77 interface routines,
which do no computation. Therefore, the Fortran77 compiler has absolutely
no effect on ATLAS's performance, and so the only reason you should need
to use a non-default f77 compiler is if the f77 compiler you wish to use
does not interoperate with ATLAS's default compiler.Installing additional f77 interfaces
The only Fortran routines in ATLAS are the Fortran77 interface routines,
which do no computation. Therefore, the Fortran77 compiler has absolutely
no effect on ATLAS's performance, and so the only reason you should need
to use a non-default f77 compiler is if the f77 compiler you wish to use
does not interoperate with ATLAS's default compiler.
struct {char *cp; F77_INTEGER len;};
struct {char *cp; F77_INTEGER len;};
F2CDEFS = -DAdd_ -DStringSunStyle
I then change this to match g77:
F2CDEFS = -DAdd__ -DStringSunStyle
Now, so that my Sun f77 interface will not be overwritten, I also change:
F77BLASlib = $(LIBdir)/libf77blas.a
to:
F77BLASlib = $(LIBdir)/libg77blas.a
If I had built the threaded BLAS, I would make a similar change to
PTF77BLASlib.
F77 = /opt/SUNWspro/bin/f77
F77FLAGS = -dalign -native -xarch=v8plusa -xO5
to:
F77 = /usr/local/bin/g77
F77FLAGS = -O3 -funroll-all-loops
Now, I cd ATLAS/interfaces/blas/F77/src/SunOS_SunUS2, and issue:
make clean
make lib
If you are using threads, additionally issue:
make ptlib
Now, when linking with Sun's f77, I link to
-lf77blas.a -latlas.a, and
when linking with g77 I use -lg77blas.a -latlas.a
make clean ; make lib
How do I link with all these libraries?
The user libs created by ATLAS are:
If you have missing symbols on link, make sure you are linking in all of the
libraries you need, and remember that order *is* significant.
For instance, a code calling the Fortran77 interface to the BLAS would need:
-L$(MY_HOME)/ATLAS/lib/$(MY_ARCH)/ -lf77blas -latlas
The full LAPACK library created by merging ATLAS and netlib LAPACK requires
both C and Fortran77 interfaces, and thus that link line would be:
-L$(MY_HOME)/ATLAS/lib/$(MY_ARCH)/ -llapack -lf77blas -lcblas -latlas
If you wish to use threaded BLAS, you simply indicate those interface libs
rather than the sequential. The above line for SMP would be:
-L$(MY_HOME)/ATLAS/lib/$(MY_ARCH)/ -llapack -lptf77blas -lptcblas -latlas
Basic compiler information
ATLAS has support for one or more compilers for every platform. In general,
we provide gcc/g77 for most supported architectures, since these compilers
are freely available. The only exception is IBM AIX platforms where we
provide xlc defaults only.
ARCH COMP1 COMP2 FASTER?
PIII/P4 2.95&le gcc &ge 3.1 icc 8.0 SAME
other x86 2.95&le gcc &ge 3.1 NONE NA
Itanium 1 gcc 3.x NONE NA
Itanium 2 icc 8.0 gcc 3.3 MUCH
Ultra2/5 Sun cc gcc 3.2 SAME
Ultra III Sun cc gcc 3.2 YES
PPCG4 Apple cc gcc &ge 3.3 SOME
Compilers to avoid
We have already discussed the various gcc versions, so now we turn our
attention to other compilers.My system doesn't have the -f option to cp
If you take the following line, and put it in a file cp you make
executable, and then put it in your path before your system cp, it should
get rid of the -f option:
/bin/cp `echo $* | sed -e 's/-f / /'`
How do I install on a Intel Celeron?
Config will ask you what your hardware is. It is recommended that you
set it to the underlying type of Pentium your Celeron actually is.
For instance, very old Celerons are Pentium II, and the newest are Pentium 4s.
ATLAS fails Level 1 BLAS tester when compiled with
gcc 2.95.2 on Compaq/DEC alphas
We have observed this problem, but not yet tracked it down. Since the exact
same testers and code work correctly with older gccs (eg, 2.8 or 2.7), we
suspect a compiler error. For now, the fix is to install gcc 2.8 or 2.7.
Building a complete LAPACK library
ATLAS does not provide a full LAPACK library. However, there is a simple way
to get ATLAS to provide its faster LAPACK routines to a full LAPACK library.
ATLAS's internal routines are distinct from LAPACK's, so it is safe to compile
ATLAS's LAPACK routines directly into a netlib-style LAPACK library.
First, download and install the standard LAPACK library from the
LAPACK homepage.
Then, in your ATLAS/lib/ARCH directory (where you should have a
liblapack.a), issue the following commands:
mkdir tmp
cd tmp
ar x ../liblapack.a
cp <your LAPACK path & lib> ../liblapack.a
ar r ../liblapack.a *.o
cd ..
rm -rf tmp
How do I restart a install from scratch?
From your ATLAS directory, issue :
make killall arch=ARCH
make startup arch=ARCH
make install arch=ARCH
How do I restart an interrupted install?
If your ATLAS install was interrupted, and you have fixed the problem,
you can usually safely (there are always exceptions; if the install died
in the middle of an ar command, for instance, many systems cannot recover)
restart the install by:
How do I do I get rid of all the .o's?
ATLAS does not have a working "make clean" that leaves the
architecture-specific directory structure in place. Issuing
"make kill arch=ARCH"
in your ATLAS directory, however, will remove all
architecture-specific subdirectories, with the exception of
ATLAS/lib/ARCH, along with all related object files. Issuing
"make killall arch=ARCH" gets rid of all architectural-specific
subdirectories.
Do NOT use the -fno-f2c flag with g77
Haven't tracked this down in a while, but it appears to break quite
a few things in fairly non-obvious ways for mixed g77/gcc libs.
What happens if I install with no Fortran compiler?
ATLAS will still install correctly, though it will obviously not create the
Fortran77 interface libraries. You will not be able to run the
testers under the ATLAS/interfaces/ directory, since these testers
are written in Fortran. Further, ATLAS expects that you will be comparing
against a Fortran77 interface BLAS, and this will obviously not be the
case, and so you will need to make the following changes if you want to
run any of the ATLAS tester/timers, even the ones written in C:
#define USE_F77_BLAS
to:
#define USE_L1_REFERENCE
#define USE_F77_BLAS
to:
#define USE_L2_REFERENCE
#define USE_F77_BLAS
to:
#define USE_L3_REFERENCE
#define TRUST_SMALL
My performance drops off for very large problem (N > 1500)
This is usually due to the normal install failing to set CacheEdge to any
value, and then eventually ATLAS winds up using memory-saving algorithms
that hurt performance. The solution is to set CacheEdge, so we use less
workspace, while improving overall performance.Post install tuning.
Here are some tips to improving ATLAS performance after an install:
Tuning CacheEdge.
CacheEdge is an Level 2 Cache blocking parameter; because it's effects are
fairly subtle on most machines, it often goes wrong on machines experiencing
any kind of load, causing performance to be be suboptimal. CacheEdge can
improve performance by as much as 15%, and it can reduce ATLAS's
memory usage as well.
./xdfindCE -m [N] -n [N] -k [N]
where [N} is replaced by a very large number that is a multiple
of your blocking factor. You want to make this number as large as you
can stand to wait on, and this varies a great deal from machine to machine.
A good guestimate for most machines might be around 2000.
TA TB M N K alpha beta CacheEdge TIME MFLOPS
== == ====== ====== ====== ====== ====== ========= ========= ========
T N 1000 1000 1000 1.00 1.00 0 5.470 365.63
T N 1000 1000 1000 1.00 1.00 16 5.470 365.63
T N 1000 1000 1000 1.00 1.00 32 5.460 366.30
T N 1000 1000 1000 1.00 1.00 64 5.470 365.63
T N 1000 1000 1000 1.00 1.00 128 5.260 380.23
T N 1000 1000 1000 1.00 1.00 256 5.240 381.68
Initial CE=256KB, mflop=381.68
Best CE=256KB, mflop=381.68
So we want to set CacheEdge to 1024*256 = 262144. atlas_cacheedge
will look something like:
#ifndef ATLAS_CACHEEDGE_H
#define ATLAS_CACHEEDGE_H
#define CacheEdge 196608
#endif
If your initial install did not use CacheEdge, line 3 will be missing
completely. If you don't have this line, you would simply add it, using
the new value of 262144. In the above example, we would simply
replace 196608 with 262144.
Arch | L2 Cache | CacheEdge |
---|---|---|
PPRO | 256K | 147456 |
PII | 512K | 262144 |
PIII | 512K | 262144 |
PIII | 256K | 249856 |
P4 | 256K | 131072 |
Athlon | 256K | 131072 |
Athlon | 512K | 307200 |
Once you have set CacheEdge to the value you need, update all libs with the new setting by issuing make xdl3blastst xsl3blastst xcl3blastst xzl3blastst in your ATLAS/bin/ARCH directory.
Special hints for setting CacheEdge for multiprocessor machines
CacheEdge turns out to be very important to threaded performance.
Unfortunately most of the default CacheEdge settings were obtained on
single processor machines. So, you may well be able to see a substantial
speedup by changing CacheEdge for your multiprocessor system.
The basic technique for finding CacheEdge is given here. Unfortunately, xdfindCE presently operates only on uniprocessor code, so what you want to use instead is varying CacheEdge and iteratively compiling and running x[pre]l3blastst_pt until you have a number you are happy with. It is vital to use a large problem. Use the largest problem you can stand to wait on for this many timing runs.
x[pre]findCE usually takes the smallest CacheEdge setting possible, since this saves memory. For multiprocessor systems, however, it is vital to use as much of the available cache as possible so that the processors spend as little time contending for the bus as possible. Thus, you want to set CacheEdge to the largest value that gives decent results. I usually run xdfindCE a few times to get an idea of ranges, and then try the larger settings by running x[pre]l3blastst_pt. Remember that threaded timings have to use walltime, so make sure any speedup is repeatable before changing CacheEdge.
Changing ATLAS's maximum buffer space.
Another way to tune ATLAS to your system is to vary the amount of
buffer space ATLAS is allowed to allocate. In general, you want to
set this as high as you can without causing swapping. If you have
a machine with low memory, and you see dramatic slowdowns as the
problem size goes up, you should definitely choose a smaller max size.
ATLAS defaults to 4MB (except on the x86-64, where we default to 8MB). To
vary this value, edit the file:
ATLAS/include/atlas_lvl3.hAnd set ATL_MaxMalloc to the maximal number of bytes that you want ATLAS to allocate internally. After editing, rebuild the libs by issuing:
make xdl3blastst xcl3blastst xzl3blastst xsl3blaststin your ATLAS/bin/ARCH directory.
Improving ATLAS small case performance by changing
malloc behavior
ATLAS allocates a buffer space for most GEMM calls. When I wrote it,
my assumption was that only first call requires a switch to kernel space
to do the allocation, and incurs the unneeded overhead of zeroing out the
memory. However, by default Linux (as well as some other OSes, such as
OS X) allocates non-trivial sized allocations using mmap, which
means that when free is called, the memory is immediately returned
to the system. Thus all malloc calls have extremely high overheads.
This is not a big problem if you are doing a large matrix multiply, where the cubic computation disguises this square cost. For small problems, though, the O(N**2) costs are actually dominant, and this type of malloc behavior effectively doubles them (at least). You should be able to change Linux's malloc behavior by setting these environment variables:
setenv MALLOC_TRIM_THRESHOLD_ -1 setenv MALLOC_MMAP_MAX_ 0
Once this is done, malloc should be cheaper, but ATLAS was tuned with the expensive malloc. Therefore, you may be able to get better small-case performance by rerunning the crossover search with these environment variables set (don't do this unless you are going to keep these settings whenever you use this library). You can rerun the search from the ATLAS/tune/blas/gemm/ARCH directory by issuing:
make sRun_tfc pre=s make dRun_tfc pre=d make cRun_tfc pre=c make zRun_tfc pre=zThis search takes a *loooong* time, then to build the changes into the libraries, go to ATLAS/bin/ARCH, and issue:
make xsl3blastst make xdl3blastst make xcl3blastst make xzl3blastst
When linking ATLAS's testers, I'm getting a bunch of
undefined BLAS symbols (eg. dgemm_, dgemv_, etc).
The ATLAS BLAS testers (x[s,d,c,z]l[1,2,3]blastst) expect to compare
against a F77 interface BLAS library for performance and testing purposes.
You get these missing symbols when your Make.ARCH's BLASlib
is left blank, or does not point at a complete BLAS library. If you have
a non-ATLAS BLAS built somewhere, point the BLASlib macro at it. If you
don't, probably the easiest fix is probably to grab the
Fortran77 reference BLAS
tarfile, and build it into the required lib. If you don't want to
do this, or don't have access to Fortran77, then you can have ATLAS
test against its own C reference as discussed
here.
I'm linking with C, and getting missing symbols
(such as w_wsfe, do_fio, w_esfe or
s_stop).
These kinds of symbols are Fortran library calls. The problem is that the
C linker does not automatically find the Fortran libraries. The most
common fix is to either link using your fortran linker, or to rewrite your
code so that Fortran routines are not called. If you know where they are,
you can also choose to link in the Fortran libraries explicitly
ATLAS performance is very bad using gcc 3.0 or
Red Hat 7.[1-3]'s gcc ( 2.96-85)
On Athlons, the ATLAS group has confirmed a performance drop of almost
a factor of 2 when using gcc 3.0 or 2.96-85.
A user has reported a similar (though less severe) drop on a Pentium III.
More details on this problem are given
here.
For now, the solution is to use a one of the older gccs.
Any of the 2.x series previous to 2.96-85 should do; complete instructions
are given here
As far as ATLAS is concerned, gcc 3.0 is more like a new compiler than
a new gcc version. This means that if you want to try using gcc 3.0
(only makes since on non-x86 platforms, at this point), you will need
to say "no" to architectural defaults, since they are unlikely to be
optimal. The only other platform aside from the Athlon we've tried
gcc on is the ev6, where it beats earlier versions of gcc performance-wise,
but only if you use different settings than the current architectural
defaults.
Installing gcc under unix without being root
You do not need to be root to install a gcc that will deliver decent
performance for ATLAS. I include below the exact steps I use to install
the C compiler only in my own home area. Changing my home area path (given
in the --prefix command to configure) to yours should allow you to do the same.
These directions are for x86 users, where ATLAS needs gcc 2.95.3 for decent
performance. They work pretty much the same for gcc 3.x, which is needed
for best ev5/6 and UltraSparc performance. Note that these directions will
install g77 as well. The fortran compiler is not needed for ATLAS performance,
so if you want to use a different fortran compiler than this version of g77,
simply omit f77 from the --enable-languages step.
Help for building ATLAS under windows
I myself do not use Windows as my primary OS, and often have difficulty
getting access to various hardware/software combinations, as well as not
being able to afford the software itself. There are people out there
who use ATLAS on windows, and I'm going to provide some links here. These
links are not supported by the ATLAS team, and may come out of date, and
I cannot warrent there accuracy, however they may provide help that is
not covered here:
Building ATLAS with a non-cygwin compiler
If you want to build ATLAS with a non-cygwin compiler (i.e., a native windows
compiler such as Intel's icl or Compac's CVF), you will need
to perform the following steps:
Now, if you want to use Compaq Visual Fortran, you'd add something like:
If you are using the Intel C compiler, you would need:
I can't run the Intel F77 compiler, but you'd need a similar line to it's
lib directory as well.
Back to Window compiler overview
ATLAS uses two compilers for C compilation. One compiler compiles
the generated matrix multiply kernels (this compiler is called MCC, and is
set using config's -m flag). The C compiler that compiles everything
else is calls CC, and is set using the -c flag.
You can pretty much set any of the three ATLAS compiler macros to any of
the supported compilers of the appropriate language. However, right now
you cannot choose gcc as the CC compiler unless you
also choose g77 as the Fortran compiler. This is a compiler bug,
and I believe the gcc folks are
already aware of it.
Along with this restriction, it should be noted that mvc is much
slower than the other C compilers if it is used for MCC. Finally,
you may want to use compilers for which there are architectural defaults. There
are arch defaults for gcc for most platforms. The Intel compiler has arch
defaults for P4 and PIII. Their are arch defaults for only CC on
the Pentium 4 for mvc. Are you hopelessly lost yet? All this
discussion is just for those who want to understand exactly what's going
on. To cut to the chase, we provide a table of the more common scenarios
and a some examples below.
This table shows some common Windows usages, listed by what compiler you
use to perform the link (i.e., what compiler you are using for your own
application). For a given Fortran & C interface that you use, the table then
shows what each compiler-controlling flag of config should be set to
for best performance. If a flag is set to NONE, that means you should not throw
that flag at all. If a YOU LINK is set to NONE, that means you don't use
that language interface to the BLAS in your application.
Therefore, if you are using gcc & g77 as your compilers, you take defaults
all the way, and can start the install with:
If you link from mvc only, you probably want:
If you link from cvf, you probably want:
All of these options should get roughly the same performance.
You can also vary the flags for each compiler on the command line,
where -F says to set the flags, and the next argument
says what flags to set, and the second argument what those flags
should be. For instance, if you want to use CVF with Intel icc,
but without the no mixed string modifier that ATLAS turns on by default,
you would have to:
Back to Window compiler overview
Next, if you are using the Intel compilers and have the older
Visual Studio 6, you need to add the flag
Back to Window compiler overview
The reason ATLAS presently requires gcc to get you threaded support is that
the supported Windows compilers don't supply the required include file
pthread.h. My MSVC++ is version 6, and I suspect newer versions
actually do provide pthreads. If they do, you can try to install, and
it may work. If it does, however, you will still have a small problem.
I am aware of no wall-timer under Windows, and you need a wall-timer to
get meaningful timings of threaded code. ATLAS defaults to calling a
cpu timer, which will make the install work, but will not give you
any idea how fast your lib is. Therefore, you will need to
turn on ATLAS's assembler timer.
To do this, before you do 'make install', Change line 110 of
ATLAS/makes/Make.sysinfo from:
Now, before install, but after config, edit the Make.ARCH created
by config, and add the definition -DPentiumCPS=[MyMhz] to the
CDEFS macro. For instance, on my 2.5Ghz P4, I would change the line:
To do this, before you run config, edit ATLAS/CONFIG/probe_SSE?.c, and
change the first executable line to be:
More generally, gcc 4 has produced slower or the same speed ATLAS code on
all architectures we've tried, so I recommend using gcc 3.x if possible.
If gcc 4 is the only compiler you've
got, you will probably be OK if on an x86 arch, as ATLAS's assembly code
will insulate you from many harmful effects. If you are on a non-x86
architecture, you should see if your your results are in line with those
reported here.
You may also want to install without arch defaults, and play with flags,
and take the best library (def/nodef). All in all, installing gcc 3.4
is not that hard!
If you are on an Itanium, and cannot get gcc 3 work, you will need to
edit ATLAS/tune/blas/gemm/CASES/?cases.flg, and change the
compiler flag lines for ATL_mm6x8x8_1p.c and
ATL_mm8x8x2.c to:
If you must use gcc 4, note that there seems to be a compiler error
(or perhaps an error in my understanding of C that isn't enforced in
any other C compiler). You need to move the prototype of ATL_L2GE
on lines 67 and 68 of ATLAS/bin/uumtst.c before the start of
the function (gcc 4 can't take static func prototypes inside functions
anymore).
In order to make config detect only SSE2, rig the SSE3 probe
to fail by commenting out line 77 of
ATLAS/CONFIG/probe_SSE3.c. Line 77 is:
Now, you can just use the compiler (gcc 3.x) and arch defaults,
and install should go smoothly.
Setting your LIB enviroment correctly
You need to set your LIB variable to be the union of all the windows
compilers you will be using. All windows compiler require the VC libs,
so start your string (modifying path & version info appropriately)
something like this:
export LIB="C:/Program Files/Microsoft Developer Studio/VC98/LIB;"
export LIB="C:/Program Files/Microsoft Developer Studio/DF98/LIB;"$LIB
export LIB="C:/Program Files/Intel/CPP/COMPILER80/Ia32/Lib;"$LIB
Telling config about your windows compilers
Config presently knows how to handle three windows-specific compilers
(in addition to the cygwin compilers g77 & gcc):
Compaq's Visual Fortran (cvf),
Microsoft Visual C (mvc), and
Intel's C compiler (icc).
Either Fortran compiler may be used to build the Fortran77 interface, with
no effect on performance. ATLAS F77 compiler is controlled through
config's -f flag.
You Link ./xconfig flags
F77 C -c -m
-f
g77/NONE gcc NONE NONE NONE
CVF MVC mvc NONE cvf
ifort icc icc icc ifort
g77/NONE MVC mvc NONE NONE
ifort gcc/NONE mvc NONE ifort
cd ATLAS ; make
cd ATLAS ; make xconfig ; ./xconfig -c mvc
cd ATLAS ; make xconfig ; ./xconfig -c mvc -f cvf
make xconfig ;
./xconfig -m icc -c icc -f cvf -F f '-fast -assume:accuracy_sensitive -fltconsistency'
NOTE: ATLAS always calls CVF with /iface:cref. Also, the accuracy
and consistancy arguments are necessary if the testers are going to pass.
Finally, the C interface testers cannot handle the CVF without the
nomixed_str_len_arg, so you will die in the C interface testing
(and thus the sanity tests) if you don't keep this flag (which is on by
default). Therefore, if you override the default flags and don't specify
nomixed, realize that failing the C interface tests comes with the
territory, and reflects a tester failing, not necessarily an error in
your library.
Post-config Make.ARCH fiddling for windows
After config has run, you need to edit the Make.ARCH by hand and change one
or two settings. First, you need to change TOPdir so it has the
full cygwin-style path, including the cygdrive letter. For instance, I
changed mine from:
TOPdir = /home/Owner/ATLAS
to:
TOPdir = /cygdrive/c/cygwin/home/Owner/ATLAS
You'll need to hunt around a bit if you don't remember where you installed
cygwin.
-Qvc6
to both your MMFLAGS and CCFLAGS.
Using pthreads under Windows
Right now, the only way to use pthreads under Windows is if you compile
with gcc. Due to the aforementioned gcc bug, this currently means that your
Fortran compiler must be g77. If this is acceptable, no extra steps are
needed: just run config as normal, and say you want to use threads.
$(CC) -c $(CCFLAGS) ATL_walltime.c
to:
$(BC) -c $(BCFLAGS) ATL_walltime.c
CDEFS = $(L2SIZE) $(INCLUDES) $(F2CDEFS) $(ARCHDEFS)
to:
CDEFS = $(L2SIZE) $(INCLUDES) $(F2CDEFS) $(ARCHDEFS) -DPentiumCPS=2500
Forcing 3DNow! detection on SSE-enabled athlon
If ATLAS detects SSE, it will not use 3DNow! instructions even if present
due to 3DNow's non-IEEE compliance. We highly recommend that you leave the
ATLAS behavior like this, as underflow/overflow (which 3DNow! absolutely
does not handle) never seems like a big deal until you get a completely
incorrect answer. However, if you are certain that your code
never produces under/overflow, and wish to use non-IEEE computations,
you can force ATLAS to detect your 3DNow! capabilities by artificially
causing the SSE probe to fail.
printf("FAILURE\n"); exit(-1);
Note that this makes sense to do only on 32-bit Athlons: ATLAS's SSE code runs
faster than it's 3DNow! code on hammer-based (eg., Athlon-64, Opteron),
while at the same time being IEEE compliant.
How about C++ header files for the C interfaces?
Since ATLAS does not provide full OO C++ interfaces, I am reluctant to raise
the expectation that it does by providing C++ specific header files. What
I have always envisioned is the C++ programmer creating his own include
files, such as:
>cat cppblas.h
extern "C" {
#include cblas.h
}
>cat cpplapack.h
extern "C" {
#include clapack.h
}
If you are a C++ programmer using ATLAS, and think differently, let me know.
Missing symbols when linking with g77 on OS X.
When linking AltiVec-enabled code under OS X using g77, I got missing symbols
such as:
/usr/bin/ld: Undefined symbols:
restFP
saveFP
Problems with linking/missing
LAPACK routines on OS X
OS X has a built-in version of ATLAS, and uses the standard names for them.
They may be less up-to-date and/or have less libs than something you install
yourself; in particular, if you have a Fortran compiler, you can build
a full lapack library, which Apple does not currently provide, and so
many users want to install the standard ATLAS.
Unfortunately, when searching for libs the compiler looks in the
system areas where apple keeps its ATLAS libs before looking in
directories supplied by -L. This means that if you use
-L and -l for your linking, you always get Apple's modified
ATLAS, rather than the one you installed. There are two fixes for this
problem that I know of. First, you
can just link to the full name and path, rather than using -L.
For instance, change something like:
gcc -o xtst test.c -L /home/whaley/TEST/ATLAS/build64/lib -lcblas -latlas
to:
gcc -o xtst test.c /home/whaley/TEST/ATLAS/build64/lib/libcblas.a \
/home/whaley/TEST/ATLAS/build64/lib/libatlas.a
The only other trick I'm aware of is to rename your ATLAS libraries so that
the Apple versions will not override them.
Config hangs in compiler search
Particularly on Solaris, ATLAS will sometimes hang in the search for
valid compilers. The easist fix is to do several ctrl-C's
(breaking out of config's find call, but not config) until config comes back
and asks you for the compiler, and then you enter the full path.
Should I use the newer gcc 4.x rather than 3.x?
Gcc 4.2 and newer are good compilers. The 4.0 and 4.1 series are poor
compilers on x86 machines, as discussed
in this gcc
buzilla report. If you are using the 3.7 series, which is highly
recommended, and should become the new stable soon, then using gcc 4.2
is recommended for all platforms. This is true even on the PowerPC,
where this
gcc bug costs you some performance.
For 3.6, you should probably stick with
gcc 3, since that it what the arch defaults config flags are tuned for.
-fomit-frame-pointer -O2 -fno-tree-loop-optimize
You should also set all your C compiler lines in your Make.ARCH
to these flag values, and not use the architectural defaults.
Avoiding SSE3 for ease of installation
I will be producing a new stable with both optimization and config support for the
SSE3 versions of architectures this summer. For right now, the best idea is
to fool config into thinking your machine is an earlier version of the architecture
that had only SSE2 (unless you are using the 3.7.1x, which has support SSE3 for
the P4e only). Then, you can use the default compilers, flags, and architectural
defaults. For pretty much everyone, this will produce a faster library than
anything you build with SSE3 support. Note that you want gcc 3.x, since newer
versions of the gcc still run slower, as explained here.
If you are on a 64-bit architecture,
use gcc, not icc, as icc didn't have 64-bit support when I added 64-bit support.
This will all be fixed soon.
if (testv3[0] != 3.0 || testv3[1] != 7.0)