Improving flag selection using mmflagsearch

If you are on a architecture or using a gcc for which configure does not suggest flags, or if you believe the present set is out-of-date, you can quickly search through a host of compiler flags to find the best set for a given gemm kernel using the specialized routine mmflagsearch.c. To do this, you need a working install, typically installed with your best guess at good flags. Now, in your BLDdir/tune/blas/gemm directory, issue make xmmflagsearch.

The idea behind this search is that it takes an ATLAS GEMM kernel description file (output from one of the ATLAS searches), and then tries a series of flags given in another file, and returns to you the best combination found. The important flags are:

-p [s,d,c,z]
: set type/precision prefix
-f $<$flagfile$>$
: file containing all flags to try
-m $<$mmfile$>$
: mmsearch output file describing kernel to time

The mmfile is the matmul kernel that you wish to use to find the best flags, and if this argument is omitted the search will automatically read res/$<$pre$>$gMMRES.sum, which is the best kernel found for the during the prior install using the scalar ANSI C generator. If bad flags have caused this search to generate a weird file, you can copy this file to a new name, and then hand edit it to have the features you like.

In the flagfile, any line beginning with `#' is ignored. This file has a special format that is more easily understood once you understand the method of the search. The user provides one line for any flags that should always appear (examples include things like -fPIC, -m64, -mcpu=XXX, etc.). This is given on the first line.

Now, the way the search is going to work is that first it will find the appropriate optimization level and fundamental flag combination, which will be searched by trying all combinations of these flags. Once these baseline flags are determined, all remaining flags will be tried one after the other using a greedy linear search. With this in mind, the format of this file is:

Required flags for all cases (eg. -fPIC -m64 -msse3 -mfpmath=sse)
<N>      Number of optimization level lines
<lvlflagset1>
....
<lvlflagsetN>
<F>      Number of fundamental flag lines
<fundflagset1>
....
<fundflagsetF>
# Now list any number of modifier flag lines
flag set 1
flag set 2
...
flag set X

So, the way this search is going to work is that we will first try all $N \times (F+1)$ combinations of the levels and fundamental flags, and choose a best-performing set. We will then try adding every provided modifier flag line to the best found combination. The best performing list will be given.

To create such a flag file one usually scopes the compiler documentation, and finds all performance-oriented flags. For gcc, you can make mmflagsearch give you a template that includes all non-architecture-specific optimization flags (as found in the documentation for gcc 4.2) by running ./xmmflagsearch -f gcc. This will create a file called gccflags.txt in the current directory, which presently has a format like:

REPLACE THIS LINE WT ARCH-DEP FLAGS ALWAYS USED (eg, -fPIC -m64 -msse3)
4
-O2
-O1
-O3
-Os
6
-fschedule-insns 
-fno-schedule-insns 
-fschedule-insns2
-fno-schedule-insns2
-fexpensive-optimizations
-fno-expensive-optimizations
# Flags to probe once optimization level is selected
...
whole boatload of flags
...

A similar file will be produced with some clang flags if you substitute clang for gcc above, though I was not able to find any central list of flags that I trusted, so the ones produced are probably insufficient and may not work on all systems.

Now lets see an example of this working on my ARM embedded machine. The first thing I do is replace the first line with my mandatory flags:

-mfpu=vfpv3 -mcpu=cortex-a8
I then add two architecture-specific flags to the auto-generated general flag list (might want to try a lot more, this is just an example), which in this case are:
-mtune=cortex-a8
-mno-thumb

An extract of this search is shown in Figure [*].

Figure: Result of ./xmmflagsearch -p d -f gccflags.txt on ARM
\begin{figure}\begin{footnotesize}
\begin{verbatim}FINDING BEST FLAGS USING MA...
...e-insns2
-fprefetch-loop-arrays'\end{verbatim}
\end{footnotesize}
\end{figure}

R. Clint Whaley 2016-07-28