User Guide

Below, we provide detailed instructions for how to start doing network simulations using TraceR.

Quickstart

This is a basic mpirun command to launch a TraceR simulation in the optimistic mode:

mpirun -np <p> ../traceR --sync=3  -- <network_config> <tracer_config>

Some useful options to use with TraceR:

--sync

ROSS’s PDES type. 1 - sequential, 2 - conservative, 3 - optimistic

--nkp

number of groups used for clustering LPs; recommended value for lower rollbacks: (total #LPs)/(#MPI processes)

--extramem

number of messages in ROSS’s extra message buffer (each message is ~500 bytes, 100K should work for most cases)

--max-opt-lookahead

leash on optimistic execution in nanoseconds (1 microsecond is a good value)

--timer-frequency

frequency with which PE0 should print current virtual time

Setting up a Simulation

Creating a TraceR configuration file

This is the format for the TraceR config file:

<global map file>
<num jobs>
<Trace path for job0> <map file for job0> <number of ranks in job0> <iterations (use 1 if running in normal mode)>
<Trace path for job1> <map file for job1> <number of ranks in job1> <iterations (use 1 if running in normal mode)>
...
<Trace path for jobN> <map file for jobN> <number of ranks in jobN> <iterations (use 1 if running in normal mode)>

If you do not intend to create global or per-job map files, you can use NA instead of them.

Sample TraceR config files can be found in examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF)

See Creating the job placement file below for how to generate global or per-job map files.

Creating the network (CODES) configuration file

Sample network configuration files can be found in examples/conf

Additional documentation on the format of the CODES config file can be found in the CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home

A brief summary of the format follows.

LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is.

MODELNET_GRP

repetition

number of routers that have nodes connecting to them.

server

number of MPI processes/cores per router

modelnet_*

number of NICs. For torus, this value has to be 1; for dragonfly, it should be router radix divided by 4; for the fat-tree, it should be router radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should also be specified (as 1). For express mesh, modelnet_express_mesh_router should also be specified as 1.

Similarly, the fat-tree config file requires specifying fattree_switch which can be 2 or 3, depending on the number of levels in the fat-tree. Note that the total number of cores specified in the CODES config file can be greater than the number of MPI processes being simulated (specified in the tracer config file).

Common parameters (PARAMS)

packet_size/chunk_size (both should have the same value)

size of the packets created by NIC for transmission on the network. Smaller the packet size, longer the time for which simulation will run (in real time). Larger the packet size, the less accurate the predictions are expected to be (in virtual time). Packet sizes of 512 bytes to 4096 bytes are commonly used.

modelnet_order

torus/dragonfly/fattree/slimfly/express_mesh

modelnet_scheduler

fcfs: packetize messages one by one.

round-robin: packetize message in a round robin manner.

message_size

PDES parameter (keep constant at 512)

router_delay

delay at each router for packet transmission (in nanoseconds)

soft_delay

delay caused by software stack such as that of MPI (in nanoseconds)

link_bandwidth

bandwidth of each link in the system (in GB/s)

cn_bandwidth

bandwidth of connection between NIC and router (in GB/s)

buffer_size/vc_size

size of channels used to store transient packets at routers (in bytes). Typical value is 64*packet_size.

routing

how are packets being routed. Options depend on the network.

torus: static/adaptive

dragonfly: minimal/nonminimal/adaptive

fat-tree: adaptive/static

Network specific parameters (PARAMS)

Torus:

n_dims

number of dimensions in the torus

dim_length

length of each dimension

Dragonfly:

num_routers

number of routers within a group.

global_bandwidth

bandwidth of the links that connect groups.

Fat-tree:

ft_type

always choose 1

num_levels

number of levels in the fat-tree (2 or 3)

switch_radix

radix of the switch being used

switch_count

number of switches at leaf level.

Creating the job placement file

Ranking basics

TraceR requires two sets of mapping files (with some what redundant information). Both types of files provide information about mapping of global rank to jobs and their local rank. Global rank of a server/core is the logical rank that server LPs get inside CODES. It increases linearly for servers/cores connected from one switch to another. Due to the way default server to node mapping works within CODES, if more than one node is connected to a switch, servers/cores are distributed in a cyclic manner.

Consider this example config file:

MODELNET_GRP
{
    repetitions="8";
    server="4";
    modelnet_dragonfly="4";
    modelnet_dragonfly_router="1";
}

Servers residing in nodes connected to the first router get global ranks 0-3, nodes connected to the second router get global ranks 4-7, and so on.

Now, consider another case:

MODELNET_GRP
{
    repetitions="8";
    server="8";
    modelnet_dragonfly="4";
    modelnet_dragonfly_router="1";
}

Servers residing in nodes connected to the first router get global ranks 0-7, nodes connected to the second router get global ranks 8-15, and so on. However, there are 8 servers but only 4 nodes, so each node hosts 2 servers. The servers are distributed in a cyclic manner within a router, i.e. in router 0, server 0 is on node 0, 1 is on node 1, 2 is on node 2, 3 is on node 3, 4 is on node 0, 5 is on node 1, 6 is on node 2, and 7 is on node 3. Similar cyclic distribution is done within every switch.

Map file requirements

Map files are divided into two sets: global map files and individual job files. The global file specifies how the global ranks are mapped to individual jobs and ranks within those jobs. It is a binary file structured as sets of 3 integers: <global rank> <local rank> <job_id>. A typical write routine looks like this:

for(...)
    fwrite(&global_rank, sizeof(int), 1, binout);
    fwrite(&local_rank, sizeof(int), 1, binout);
    fwrite(&jobid, sizeof(int), 1, binout);
endfor

For each job, individual job map files are needed. A map file for a job is also a binary file filled with a series of global ranks. The global ranks are ordered by using the local ranks as the key. So, if the series of integers is loaded into an array called local_to_global, local_to_global[i] will contain the global rank of local rank i.

Job mappers

In the utils subfolder of the TraceR repository, there are several job mappers written in C that can be used to generate job map files with various layouts. Eventually these will likely be rewritten as a Python script. A brief summary of the generators provided follows.

def_lin_mapping.C

Generates a linear mapping which is also the default mapping when no mapping is specified. If nodes per router is more than 1, then this mapping will spread the ranks in a round-robin fashion among the nodes.

node_mapping.C

Generates a mapping that always places servers with contiguous global ranks on a node. That is, if there are 2 servers per node, ranks 0-1 are on node 0, ranks 2-3 are on node 1, and so on.

multi_job.C

Router based various schemes for mapping.

many_job.C

Nodes based various schemes for mapping.

Commands for execution

./def_lin_mapping <global_map_file> <space separated #ranks in each job>

./node_mapping <global_map_file> <total_ranks in the job> <nodes per router> <cores per node> [optional <nodes with router to skip after>]

The output from these commands will be a global map file, and job{0,1..} files in binary format.

Example:

./def_lin_mapping global.bin 32 32 64

The above command generates global.bin with 128 ranks, where the first 32 are mapped to job0, the next 32 to job1, and last 64 to job2. It also generates job0, job1, and job2 that maps ranks from these jobs to global ranks.

Generating Traces

Score-P

Installation of Score-P

  1. Download from http://www.vi-hps.org/projects/score-p/

  2. tar -xvzf scorep-5.0.tar.gz

  3. cd scorep-5.0

  4. CC=mpicc CFLAGS=”-O2” CXX=mpicxx CXXFLAGS=”-O2” FC=mpif77 ./configure –without-gui –prefix=<SCOREP_INSTALL>

  5. make

  6. make install

Generating OTF2 traces with an MPI program using Score-P

Detailed instructions are available at https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf.

  1. Add $SCOREP_INSTALL/bin to your PATH for convenience. Example:

    export SCOREP_INSTALL=$HOME/workspace/scoreP/scorep-5.0/install
    export PATH=$SCOREP_INSTALL/bin:$PATH
    
  2. Add the following compile time flags to the application:

    -I$SCOREP_INSTALL/include -I$SCOREP_INSTALL/include/scorep -DSCOREP_USER_ENABLE
    
  3. Add #include <scorep/SCOREP_User.h> to all files where you plan to add any of the following Score-P calls (optional step):

    SCOREP_RECORDING_OFF(); - stop recording
    SCOREP_RECORDING_ON(); - start recording
    

Marking special regions: SCOREP_USER_REGION_BY_NAME_BEGIN(regionname, SCOREP_USER_REGION_TYPE_COMMON) and SCOREP_USER_REGION_BY_NAME_END(regionname).

Region names beginning with TRACER_WallTime_ are special: using TRACER_WallTime_<any_name> prints current time during simulation with tag <any_name>.

An example using these features is given below:

#include <scorep/SCOREP_User.h>
...
int main(int argc, char **argv, char **envp)
{
  MPI_Init(&argc,&argv);
  SCOREP_RECORDING_OFF(); //turn recording off for initialization/regions not of interest
  ...
  SCOREP_RECORDING_ON();

  //use verbatim to facilitate looping over the traces in simulation when simulating multiple jobs
  SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_Loop", SCOREP_USER_REGION_TYPE_COMMON); 
  // at least add this BEGIN timer call - called from only one rank
  // you can add more calls later with region names TRACER_WallTime_<any string of your choice>

  if(myRank == 0)
    SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop", SCOREP_USER_REGION_TYPE_COMMON);

  // Application main work LOOP
  for ( int itscf = 0; itscf < nitscf_; itscf++ )
  {
    // time call to mark start of loop iteration
    SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop_Iter", SCOREP_USER_REGION_TYPE_COMMON);
    ...
    SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop_Iter");
  }

  // time call to mark END of work - called from only one rank
  if(myRank == 0)
    SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop");

  // use verbatim - mark end of trace loop
  SCOREP_USER_REGION_BY_NAME_END("TRACER_Loop"); 
  SCOREP_RECORDING_OFF();//turn off recording again
  ...
}
  1. For the link step, prefix the linker line with the following:

    LD = scorep --user --nocompiler --noopenmp --nopomp --nocuda --noopenacc --noopencl --nomemory <your_linker>
    
  2. For running, set:

    export SCOREP_ENABLE_TRACING=1
    export SCOREP_ENABLE_PROFILING=0
    export SCOREP_MPI_ENABLE_GROUPS=ENV,P2P,COLL,XNONBLOCK
    

If Score-P prints a warning about flushing traces during the run, you may avoid them using:

export SCOREP_TOTAL_MEMORY=256M
export SCOREP_EXPERIMENT_DIRECTORY=/p/lscratchd/<username>/...

Note

For larger simulations, performance can get slow. There is a patch for Score-P 5.0 that adds an option to reduce the number of MPI Probes. After applying the patch, it can be enabled like the other Score-P options with export SCOREP_REDUCE_PROBE_TEST=1.

  1. Run the binary and traces should be generated in a folder named scorep-*.

BigSim

Installation of BigSim

Compile BigSim/Charm++ for emulation (see the BigSim manual for more detail). Use any one of the following commands:

  • To use UDP as BigSim/Charm++’s communication layer:

    ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing
    ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing
    

    Or explicitly provide the compiler optimization level:

    ./build bgampi net-linux-x86_64 bigemulator -O2
    
  • To use MPI as BigSim/Charm++’s communication layer:

    ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing
    

Note

This build is used to compile MPI applications so that traces can be generated. Hence, the communication layer used by BigSim/Charm++ is not important. During simulation, the communication will be replayed using the network simulator from CODES. However, the computation time captured here can be important if it is not being explicitly replaced at simulation time using configuration options. So using appropriate compiler flags is important.

Generating AMPI traces with an MPI program using BigSim

  1. Compile your MPI application using BigSim/Charm++.

Example commands:

$CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c
$CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx
  1. Emulation to generate traces. When the binary generated is run, BigSim/Charm++ runs the program on the allocated cores as if it were running as usual. Users should provide a few additional arguments to specify the number of MPI processes in the prototype systems.

If using UDP as the BigSim/Charm++’s communication layer:

./charmrun +p<number of real processes> ++nodelist <machine file> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog

If using MPI as the BigSim/Charm++’s communication layer:

mpirun -n<number of real processes> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog

Number of real processes is typically equal to the number cores the emulation is being run on.

machine file is the list of systems the emulation should be run on (similar to machine file for MPI; refer to Charm++ website for more details).

vp is the number of MPI ranks that are to be emulated. For simple tests, it can be the same as the number of real processes, in which case one MPI rank is run on each real process (as it happens when a regular program is run). When the number of vp (virtual processes) is higher, BigSim launches user level threads to execute multiple MPI ranks within a process.

+x +y +z defines a 3D grid of the virtual processes. The product of these three dimensions must match the number of vp’s. These arguments do not have any effect on the emulation, but exist due to historical reasons.

+bglog instructs bigsim to write the logs to files.

  1. When this run is finished, you should see many files named bgTrace* in the directory. The total number of such files equals the number of real processes plus one. Their names are bgTrace, bgTrace0, bgTrace1, and so on. Create a new folder and move all bgTrace files to that folder.