Network Models¶
This guide will give an overview of some of the network models supported by TraceR, as presented in the HOTI 25 tutorial (slides 22-39). For a more detailed guide, see the CODES wiki pages on network models at https://github.com/codes-org/codes/wiki/codes-networks. Any commands/examples in this section are referring to files included in the CODES git repository (not TraceR).
Overview¶
Multiple network models are supported, including dragonfly, fat
tree, express mesh, hyperX, torus, slim fly, and LogP. An abstraction
layer, model-net, sits on top of network models that breaks
messages into packets and offers FIFO, round robin, and priority
queues. To try different networks, simply switch the network configuration
files used when running TraceR. Storage models, MPI simulation, and
workload replay layers are independent of the underlying network
model used.
Simplenet¶
The Simplenet model uses a latency/bandwidth model where messages are sent directly from the source to the destination. It uses infinite queueing, and is easy to setup - a startup delay and link bandwidth are used for configuration. This model is mostly for debugging and testing purposes and can be used as a starting point when replaying MPI traces. It can be used as a baseline network model with no contention and no routing.
Configuring¶
Consider this Simplenet configuration file that can be found in codes/tests/conf/modelnet-test.conf:
LPGROUPS
{
MODELNET_GRP
{
repetitions="16";
server="1";
modelnet_simplenet="1";
}
}
PARAMS
{
packet_size="512";
message_size="384";
modelnet_order=( "simplenet" );
# scheduler options
modelnet_scheduler="fcfs";
net_startup_ns="1.5";
# bandwidth is in MiB/s
net_bw_mbps="20000";
}
The MODELNET_GRP section is used for mapping entities to ROSS MPI processes.
Messages are broken into packets by the model-net layer,
with a size that can be set by the packet_size param.
The message_size parameter is a ROSS specific parameter
that is used to set the event size.
net_startup_ns sets the startup delay in nanoseconds.
net_bw_mbps sets the link bandwidth in MB/s between nodes.
There is one link between each pair of nodes.
Running¶
The model shown above can be run in CODES with:
./tests/modelnet-test --sync=1 -- tests/conf/modelnet-test.conf
The command runs a simple test in which a simulated MPI rank sends a message to the next rank, which replies back. This continues until a certain number of messages is reached.
Dragonfly¶
The dragonfly network model has a hierarchy with a set of groups connected with all-to-all links. Within a group there can be several routers connected with local links, and routers can have links to routers in other groups for intergroup connections. Routers will also have compute nodes connected to to them. The CODES wiki explains dragonfly networks in much greater detail, and the slides from the HOTI 25 tutorial have images showing examples of possible dragonfly networks.
Dragonfly networks support minimal, adaptive, non-minimal, and progressive adaptive routing. They use packet based simulation with credit based flow control, and use multiple virtual channels for deadlock prevention.
Configuring¶
Consider this example configuration that can be found with the CODES source, codes/src/network-workloads/dragonfly-custom:
LPGROUPS
{
MODELNET_GRP
{
repetitions="2400";
# name of this lp changes according to the model
nw-lp="4";
# these lp names will be the same for dragonfly custom model
modelnet_dragonfly_custom="4";
modelnet_dragonfly_custom_router="1";
}
}
nw-lp is a simulated MPI process. For simulating multiple MPI
processes per node, set this to the number of processes times the
number of network nodes.
modelnet_dragonfly_custom is a simulated dragonfly network node.
modelnet_dragonfly_custom_router is a simulated dragonfly network router.
Self messages are messages sent to the same network node. The overhead for sending self messages can be configured.
Continuing in the same configuration file, look at the PARAM section:
PARAMS
{
# packet size in the network
packet_size="4096";
modelnet_order=( "dragonfly_custom","dragonfly_custom_router" );
# scheduler options
modelnet_scheduler="fcfs";
# chunk size in the network (when chunk size = packet size, packets will not be
# divided into chunks)
chunk_size="4096";
# number of routers within each group
# this is dictated by the dragonfly configuration files
num_router_rows="6";
# number of router columns
num_router_cols="16";
# number of groups in the network
num_groups="25";
# buffer size in bytes for local virtual channels
local_vc_size="8192";
# buffer size in bytes for global virtual channels
global_vc_size="16384";
# buffer size in bytes for compute node virtual channels
cn_vc_size="8192";
# bandwidth in GiB/s for local channels
local_bandwidth="5.25";
# bandwidth in GiB/s for global channels
global_bandwidth="4.69";
# bandwidth in GiB/s for compute node-router channels
cn_bandwidth="16.0";
# ROSS message size
message_size="592";
# number of compute nodes connected to router, dictated by dragonfly configuration
# file
num_cns_per_router="4";
# number of global channels per router
num_global_channels="4";
# network config file for intra-group connections
intra-group-connections="../src/network-workloads/conf/dragonfly-custom/intra-9K-custom";
# network config file for inter-group connections
inter-group-connections="../src/network-workloads/conf/dragonfly-custom/inter-9K-custom";
# routing protocol to be used
routing="prog-adaptive";
}
num_router_rows and num_router_cols control the router arrangement within a group
and should match the input network configuration.
local_vc_size, global_vc_size, and cn_vc_size are used to configure the buffer
size of virtual channels.
num_cns_per_router is used to set the number of compute nodes per router.
intra-group-connections and inter-group-connections are set to network configuration
files that can be custom generated (see scripts/gen-cray-topo/README.txt).
Running¶
To run a dragonfly network simulation, try the following:
Download the traces:
wget https://portal.nersc.gov/project/CAL/doe-miniapps-mpi-traces/AMG/df_AMG_n1728_dumpi.tar.global_vc_size
Run the simulation:
./src/network-workloads/model-net-mpi-replay --sync=1 --disable_compute=1 --workload_type="dumpi" --workload_file=df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- --num_net_traces=1728 -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-edison.conf
Fat Tree¶
The Fat Tree network model can simulate two and three level fat tree networks. The width of the tree (number of pods) can also be configured. Two forms of routing are supported; static which uses destination-based look-up tables, and adaptive which selects the least congested output port. The simulation is packet-based with credit-based flow control.
Tapering can be used in a fat tree network configuration to connect more nodes to leaf switches, which reduces the bandwidth, switches, and links at a higher level.
To get higher bandwidth, nodes can connect to multiple ports (multi-rail) in one or more plane (multi-plane). These configurations can also be tapered to reduce switches and links at higher levels.
The model supports configurations for multiple rails, multiple plane, and tapering.
Configuring¶
Consider the first part of this configuration file:
LPGROUPS
{
MODELNET_GRP
{
repetitions="198";
nw-lp="144";
modelnet_fattree="18";
fattree_switch="3";
}
}
nw-lp is a simulated MPI process.
modelnet_fattree is a simulated fat tree network node.
fattree_switch sets the number of simulated fat tree network
switches. In the above example it is set to 3 (one in each level
of the network).
Now, consider the next section in the configuration file:
PARAMS
{
packet_size="4096";
message_size="624";
chunk_size="4096";
modelnet_scheduler="fcfs"
modelnet_order=( "fattree" );
ft_type="0";
num_levels="3";
switch_count="198";
switch_radix="36";
vc_size="65536";
cn_vc_size="65536";
link_bandwidth="12.5";
cn_bandwidth="12.5";
routing="static";
routing_folder="/Fat-Tree/summit";
dot_file="summit-3564"
dump_topo="0";
}
The switch arrangement set with ft_type, num_levels, and switch_count
should match the input network configuration.
switch_radix can be configured.
Static routing requires precomputed destination routing tables, for details see https://xgitlab.cels.anl.gov/codes/codes/wikis/codes-fattree#enabling-static-routing.
Slim Fly¶
The Slim Fly network model has a topology of interconnected router groups build with MMS graphs. The maximum network diameter is always 2. It uses a packet-based simulation with credit-based flow control. The forms of routing supported are minimal with 2 virtual channels, non-minimal with 4 virtual channels, and adaptive with 4 virtual channels.
Configuring¶
Consider the params section for a slim fly network configuration file:
PARAMS
{
packet_size="4096";
chunk_size="4096";
message_size="592";
modelnet_order=( "slimfly" );
modelnet_scheduler="fcfs";
num_routers="13";
num_terminals="9";
global_channels="13";
local_channels="6";
generator_set_x=("1","10","9","12","3","4");
generator_set_x_prime=("6","8","2","7","5","11");
local_vc_size="25600";
global_vc_size="25600";
cn_vc_size="25600";
local_bandwidth="12.5";
global_bandwidth="12.5";
cn_bandwidth="12.5";
routing="minimal";
num_vcs="4";
}
num_routers, num_terminals, global_channels, and local_channels can
be used to confiure the router arrangement within a group.
Generator sets are a set of indices used to calculate connections between routers
in the same subgraph. They must be precomputed. The params generator_set_x and
generator_set_x_prime are set based on the precomputed indices.
Torus¶
A torus network is based on a n-dimensional k-ary network topology. The number of torus dimensions and length of each dimension can be configured. The network model supports dimension order routing.
Express Mesh and HyperX¶
The express mesh topology is low-diameter densely connected grids. The model alllows for specifying the connection gap. A gap of 1 is a HyperX network. A bubble escape virtual channel is used for deadlock prevention.
Interpreting Simulation Output¶
Using the example run on the dragonfly network given above, we ge tthe following output:
Total GVT Computations 0
Total All Reduce Calls 0
Average Reduction/GVT nan
Total bytes sent 13584368 recvd 13584368
max runtime 449332.124035 ns avg runtime 443706.882419
max comm time 449332.124035 avg comm time 443706.882419
max send time 5142770.436275 avg send time 2779472.247926
max recv time 4149449.596308 avg recv time 2335071.940672
max wait time 432820.362362 avg wait time 430457.043452
_P-IO: writing output to dragonfly-simple-33405-1499374633/
_P-IO: data files:
dragonfly-simple-33488-1499374633/dragonfly-router-traffic
dragonfly-simple-33488-1499374633/dragonfly-router-net_stats
dragonfly-simple-33488-1499374633/dragonfly-msg-stats
dragonfly-simple-33488-1499374633/model-net-category-all
dragonfly-simple-33488-1499374633/model-net-category-test
dragonfly-simple-33488-1499374633/mpi-replay-stats
Average number of hops traversed 1.709869 average chunk latency 0.925252 us maximum chunk latency 9.312357 us avg message size 812.563110 bytes finished messages 16820 finished chunks 65012
ADAPTIVE ROUTING STATS 65012 chunks routed minimally 0 chunks routed non-minimally completed packets 65012
Total packets generated 39722 finished 39722
As shown in the sample output, average and maximum times are reported for all application runs with statistics on time spent in overall execution, communication, wait operations, amount of data transferred, and so on.
Enabling lp-io-dir generates detailed network statistics files. The network statistics (hops traversed, latency, routing, etc) are reported for the entire network.
Detailed statistics for each MPI rank, network node, router, and port are generated using the lp-io-dir option.
--lp-io-dir=my-dir can be used to enable statistics generation (each lp
writes its statistics to a summary file).
Statistics Reported by LP-IO¶
Dragonfly-msg-stats has the number of hops, packet latency, packets
sent/received, and link saturation time reported for each network node.
Dragonfly-router-stats has the link saturation time for each router port.
Dragonfly-router-traffic has the traffic sent for each router port.
Fat tree and slim fly networks have similar statistics files.
Mpi-replay-stats (generated for any network model) has the bytes
sent/received per MPI process, the time spent in communication per
MPI process, and the number of sends and receives per MPI process.