Four Steps to Simulations

Creating a network simulation can be broken down into 4 steps:

1. Prototype the system design

An overview of setup using network parameters was given in the Network Models guide.

2. Workload selection

There are two types of workloads that can be used in a simulation, synthetic workloads and HPC application traces.

Synthetic Workloads

Synthetic workloads follow specific communication patterns with a constant injection rate. Often they are used to stress the network topology to identify best and worst case performance. Examples of synthetic workloads include uniform random, all to all, bisection pairing, and bit permutation. These workloads don’t require simulation of MPI operations, and could be used to generate background traffic that can simulate interference with an application trace caused by a production HPC system having a significant fraction of network nodes being occupied.

Uniform Random: A network node is equally likely to send to any other network node (traffic distributed throughout the network).

All to All: Each network node communicates with all other network nodes.

Nearest Neighbor: A network node communicates with nearby network nodes (or the ones that are at minimal number of hops).

Permutation Traffic: Source node sends all traffic to a single destination based on a permutation matrix.

Bisection Pairing: Node 0 communicates with Node ‘n’, Node 1 with ‘n-1’, and so on.

HPC Application Traces

Application traces are captured by running an MPI program. They are useful for network performance prediction of production HPC applications. Trace sizes can be large for long running or communication intensive applications, but they have the potential to capture computation-communication interplay. These workloads require accurate simulation of MPI operations, and simulation results can be complex to analyze.

3. Workload creation

A workload can be created by capturing application traces from running an MPI program. Options for capturing a trace include using DUMPI, Score-P, and BigSim.

Information in a Typical Trace

A typical trace captured (e.g. in DUMPI, OTF2, BigSim) for an MPI program contains information on the operations that occur at different times with critical information for the operation. The table below gives an example of a typical trace.

Time stamp, t (rounded off)

Operation type

Operation data (only critical information is highlighted)

t = 10

MPI_Bcast

root, size of bcast, communicator

t = 10.5

MPI_Irecv

source, tag, communicator, req ID

t = 10.51

user_computation

optional region name - “boundary updates”

t = 12.51

MPI_Isend

dest, tag, communicator, req ID

t = 12.53

user_computation

optional region name - “core updates”

t = 22.53

MPI_Waitall

req IDs

t = 25

MPI_Barrier

communicator

Effect of Replaying Traces

As shown in the table below, replaying a trace can result in different results from the original run due to different configurations resulting in operations taking more or less time to run. In the first and 2nd to last table entries, the MPI_Bcast and MPI_Waitall operations are faster in the replayed trace, resulting in subsequent operations happening at earlier times than when the trace was captured.

Original time stamps

Original duration

New time stamps

New duration

Operation type

10

0.5

10

0.2

MPI_Bcast

10.5

0.01

10.2

0.01

MPI_Irecv

10.51

2

10.21

2

user_computation

12.51

0.02

12.21

0.02

MPI_Isend

12.53

10

12.23

10

user_computation

22.53

2.47

22.23

0.03

MPI_Waitall

25

1

22.26

1.7

MPI_Barrier

In addition to the affect of the network configuration, different trace formats may result in different results.

As an example, DUMPI traces store all the information passed to MPI calls. The simulation then decides which request to fulfill, allowing accurate resolution for the target systems. If the control flow of the program can change significantly due to the ordering of operations, then simulations are not entirely correct.

On the other hand, OTF2 traces store only the information that is used (e.g. which request was satisfied). This accurately mimics the control flow of the trace run, but does not accurately represent execution for the target system.

These differences are artifacts of leveraging existing tools not originally intended for Parallel Discrete Event Simulation (PDES).

4. Execution

The user guide Quickstart section shows the arguments taken by TraceR and some of the options available to control execution of a simulation.