NEMOLite2D Distribution
-----------------------

Authors: H. Liu, National Oceanography Centre, Liverpool
         A. Porter, STFC Daresbury Laboratory
         J. Appleyard, NVIDIA Inc.

This README is for the distribution of the reference and optimised
versions of the NEMOLite2D benchmark created as part of the GOcean
Project. The benchmark is based upon the 2D, free-surface part of the
NEMO ocean model and was originally created by Hedong Liu of the
National Oceanography Centre, Liverpool, UK.

* BEFORE COMPILING AND RUNNING PLEASE READ THIS DOCUMENT, *
* ESPECIALLY 'Technical details' REGARDING COMPILATION.   *

Introduction and Purpose
------------------------

This distribution contains various versions of the nemolite2d code.
First, there is the original, serial version of the code in
nemolite2d_serial/nemolite2d_orig. This was subsequently re-structured
to follow the 'PSyKAl' separation of concerns where PSyKAl stands for
Parallel System, Kernel and Algorithm layers. In this PSyKAl version
(in nemolite2d_serial/nemolite2d_vanilla_single_invoke) all
computation is performed in point-wise kernels (the bottom layer). The
algorithm (top) layer specifies operations on the whole solution
arrays. The middle layer (prefixed by invoke_'s) glues the Algorithm
and Kernel layers together.  Ultimately this middle layer will be
generated by the PSyclone system but that's another story.

What we are ultimately interested in is a) whether there is a
performance cost in using the PSyKAl approach compared with the
original code, and if so, why and how much, and b) what optimisations
improve performance, whether these differ across architectures (and
possibly problem sizes), and whether these optimisations can be
replicated in the PSyclone generation system. We have therefore
performed a series of optimisation steps on the nemolite2d code
and each version is in a directory under nemolite2d_serial.
Following this, we then experimented with optimising for parallel
performance using both OpenMP and OpenACC - the corresponding 
sequence of version of the code are in nemolite2d_omp and
nemolite2d_acc, respectively.

Getting Going
-------------

If you're impatient to build and run something then this section is
for you. To build a version:

 1. Change to the relevant directory, e.g. nemolite2d_serial/nemolite2d_orig
 2. Create a suitable Makefile.include file in that directory (examples
    you can copy or link are in the arch directory).
 3. Run make. This should build the GOcean and dl_timer libraries and
    then the nemolite2d application itself.
 4. The problem size, number of time-steps etc. are read from the
    file 'namelist' so edit this as required.
 5. You can now run the executable produced in step 3.
 6. Examine the timing-report that is printed to stdout. We're interested
    in the mean time per time-step.

Contents
--------

The directory containing this README should also contain the following
files and directories:

api_v1.0          - contains the GOcean infrastructure library sources
dl_timer          - contains the dl_timer library
arch              - contains various examples of Makefile.include
LICENSE           - the license under which this code is distributed
nemolite2d_serial - all serial versions of NEMOLite2D (including the
		    original, unchanged form)
nemolite2d_acc    - the various OpenACC versions of NEMOLite2D
nemolite2d_omp    - the various OpenMP versions of NEMOLite2D

Please see the individual README files under the various nemolite2d_*
directories for more details about each set of optimised sources.

Problem sizes
-------------

Note that the problem size defined in the namelist file is for the
*whole* of the model domain. The simulated domain sits within this
surrounded by a shell, one deep, of grid points. This shell of grid
points defines the boundary conditions.  The default size provided is
258 which gives a simulated domain of 256. This is set in the
'namelist' file.  We would like results from the following "base"
problem sizes: 258, 514, 1026. Example namelists for these are given
in the nemolite2d directory, named
'namelist.<internal_domain_size>'. Note that the number of time-steps
is reduced for the larger domain sizes in order to keep run-times
down.

The following (small) optional sizes are also of interest,
primarily for *CPU's* as the cache performance can improve for
smaller problems:
66, 130

The following (large) optional sizes are also of interest,
primarily for GPU's, to ensure there is enough work to keep the GPU
busy:
2050, 4098

Whilst we have provided suggested problem sizes that give simulated
regions that are powers of 2 in extent, it may be the case that the
relative performance differs when non-powers-of-two sizes are compared
with powers-of-2 sizes and you might want to look into these if you
have time.

Timing and timing calls
-----------------------

The only timing we actually require is the time per step which is
obtained from the timer around the time-stepping loop. In the provided
code this is reported to stdout as the 'Average/repeat' time. Other
timers might be useful to see how the different parts of the code
scale but should not get in the way of optimisations. As an example,
if you fused two loops together, each with their own timer then the
obvious thing to do would be to replace the two timers with a single
timer (appropriately named) around the fused loop.

Jitter (noise) in the timing results can be a problem, particularly
for the smaller test cases. We therefore recommend that the number of
time-steps in the namelist file is set to a value that produces
run-times for which jitter is not significant.  For reproducible
results you must ensure that the process is being pinned to a
particular core.  We recommend repeating the run a number of times
(~5-10) and taking the time of the quickest run as being 'the answer'.

Note that both the GOcean and dl_timer libraries contain support for
OpenMP and therefore will report things such as the number of OpenMP
threads being used, even if the actual version of NEMOLite2D you are
running is not OpenMP parallel.

Technical details (compilation etc.)
------------------------------------

The Makefile for each version of nemolite2d included in this
distribution expects a Makefile.include configuration file to be in
the same directory as the Makefile. Examples of this file for various
compilers are provided in the arch directory. Create
'Makefile.include' by either copying or linking to one of the supplied
arch/Makefile.include.<compiler_name> files. You will almost certainly
have to edit the settings in the file in order to match your local
set-up. Once this is done, simply running 'make' should produce an
executable.

The majority of the versions of nemolite2d depend upon the GOcean
Infrastructure library which is in the api_v1.0 directory.  It should
be built automatically by the make system when the nemolite2d code is
built. Similarly, the dl_timer library is used to time program execution
and should also be built automatically as required.

Contact
-------

If you have any questions about this distribution then please feel free
to contact either Andrew Porter or Rupert Ford who can be reached at
firstname dot lastname at stfc.ac.uk.
