Parallel Execution¶
Sitar supports shared-memory parallel simulation via OpenMP. The two-phase execution model maps naturally onto a parallel loop: all modules are independent within each phase (no module reads what another writes in the same phase), so every module in a phase can run on a separate thread. A single barrier between phases is all that is needed for correctness.
This page covers how to enable parallel execution, how to measure speedup, and how to customize the mapping of modules to threads.
How Parallelism Works in Sitar¶
The default simulation loop (in sitar_default_main.cpp) runs as follows in parallel mode:
for each (cycle, phase):
#pragma omp for -- each thread runs a subset of modules
#pragma omp barrier -- all threads synchronize before next phase
The flattenHierarchy function collects all modules into a flat list. OpenMP distributes that list across threads using a static schedule (round-robin by default). The barrier after each phase enforces the read/write discipline: no module begins the next phase until all modules have completed the current one.
Because modules are independent within a phase by construction (the two-phase rule prohibits same-phase read-write conflicts), no locks or shared state are needed inside the loop.
A Simple Example¶
The following model has four modules connected in a clique. Each module burns approximately 1 ms of CPU time per phase using a busy-wait loop, and sends a token to a randomly chosen neighbour every COMM_INTERVAL cycles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
The communication structure:
flowchart LR
subgraph TOP
subgraph sys["sys (System)"]
A["a (Node)"]
B["b (Node)"]
C["c (Node)"]
D["d (Node)"]
A <-->|"ab / ba"| B
A <-->|"ac / ca"| C
A <-->|"ad / da"| D
B <-->|"bc / cb"| C
B <-->|"bd / db"| D
C <-->|"cd / dc"| D
end
end
Compiling and Running¶
Compile without OpenMP for a serial baseline:
Then compile with OpenMP and compare:
The time command reports wall-clock elapsed time. With 4 modules each doing ~2 ms of work per cycle (1 ms per phase), the serial run takes approximately 20 cycles x 2 ms = 40 ms. With 4 threads you should see close to 4x speedup, approaching 10 ms.
Setting the Number of Threads¶
The number of threads is controlled by the OMP_NUM_THREADS environment variable:
export OMP_NUM_THREADS=1 # effectively serial
export OMP_NUM_THREADS=2
export OMP_NUM_THREADS=4 # one thread per module for this example
Set OMP_NUM_THREADS to the number of modules (or a divisor of it) for best load balance with the default static schedule.
Customizing Module-to-Thread Mapping¶
By default, flattenHierarchy collects every module in the hierarchy — including container modules that have no ongoing behavior — and distributes them round-robin across threads. For most models this is fine, but for large regular structures (such as an N×M mesh) it is more efficient to run only the leaf compute modules in parallel and leave structural container modules out of the list entirely.
To do this, supply a custom main.cpp at compile time:
Selecting specific modules by name¶
For small models with named submodules, build the list explicitly:
vector<module*> modules_to_run;
modules_to_run.push_back(&TOP->sys.a);
modules_to_run.push_back(&TOP->sys.b);
modules_to_run.push_back(&TOP->sys.c);
modules_to_run.push_back(&TOP->sys.d);
With OMP_NUM_THREADS=2 and schedule(static), OpenMP assigns the first half of the list to thread 0 and the second half to thread 1.
Selecting all children of a parent (for arrays)¶
For models that use submodule_array — where individual instances cannot be named explicitly in C++ — iterate over the parent module's _submodules map instead:
void buildModuleList(vector<sitar::module*>* list, sitar::module* parent)
{
for (auto it = parent->_submodules.begin();
it != parent->_submodules.end(); ++it)
list->push_back(it->second);
}
// In main, after instantiating TOP:
vector<sitar::module*> modules_to_run;
buildModuleList(&modules_to_run, &(TOP->system));
This adds all direct children of system (i.e., all node[i][j] instances in a mesh) without naming them individually. TOP and system are left out of the list; they run implicitly via runHierarchical in serial mode, or are simply unused if they have no ongoing behavior.
For a two-level hierarchy (e.g., the children of system are themselves arrays), call buildModuleList recursively or iterate two levels deep as needed.
Important Considerations¶
Logging in Parallel Mode¶
In parallel execution, multiple modules run concurrently. Writing to a shared output stream (such as std::cout) from multiple threads simultaneously will interleave log lines unpredictably. Sitar handles this by assigning each module its own log file in parallel mode:
string log_name = modules_to_run[i]->hierarchicalId() + "_log.txt";
logstreams[i]->open(log_name.c_str());
modules_to_run[i]->log.setOstream(logstreams[i]);
This produces one log file per module (e.g. TOP.sys.a_log.txt, TOP.sys.b_log.txt, etc.), each written exclusively by one module. The files can be inspected individually or merged and sorted by timestamp after simulation.
Warning
Never share a single output stream across modules in parallel mode. Even if individual << calls are individually atomic, multi-field log lines will interleave across threads, producing unreadable output.
Random Number Generation¶
If your modules use random number generation, each module must use its own independent random number generator. Sharing a single generator across threads without locking causes data races and non-deterministic (and incorrect) results.
The recommended pattern is to declare a generator as a member of each module and seed it uniquely before simulation starts:
Then in the main, before the parallel loop, assign a unique seed to each module:
Inside the module behavior, use seed (combined with this_cycle for additional variation if needed) to initialize calls to srand / rand or any other generator:
Warning
Do not use a global srand call or a shared rand() in parallel simulation. Each execution thread must have its own generator state, seeded independently.
What's Next¶
Return to the Language and Examples section to learn the full Sitar modeling language, or jump directly to Advanced Examples for complete working models.