Parallel Execution¶

Sitar supports shared-memory parallel simulation via OpenMP. The two-phase execution model maps naturally onto a parallel loop: all modules are independent within each phase (no module reads what another writes in the same phase), so every module in a phase can run on a separate thread. A single barrier between phases is all that is needed for correctness.

This page covers how to enable parallel execution, how to measure speedup, and how to customize the mapping of modules to threads.

How Parallelism Works in Sitar¶

The default simulation loop (in sitar_default_main.cpp) runs as follows in parallel mode:

for each (cycle, phase):
    #pragma omp for  -- each thread runs a subset of modules
    #pragma omp barrier  -- all threads synchronize before next phase

The flattenHierarchy function collects all modules into a flat list. OpenMP distributes that list across threads using a static schedule (round-robin by default). The barrier after each phase enforces the read/write discipline: no module begins the next phase until all modules have completed the current one.

Because modules are independent within a phase by construction (the two-phase rule prohibits same-phase read-write conflicts), no locks or shared state are needed inside the loop.

A Simple Example¶

The following model has four modules connected in a clique. Each module burns approximately 1 ms of CPU time per phase using a busy-wait loop, and sends a token to a randomly chosen neighbour every COMM_INTERVAL cycles.

// Four modules connected in a clique.
// Each module does heavy computation every cycle (busy-wait, ~1ms).
// Every COMM_INTERVAL cycles, each module sends a token to one
// randomly chosen neighbour.
// COMM_INTERVAL is a top-level parameter.

module Top
    submodule sys : System<5>   // COMM_INTERVAL = 5
end module

module System
    parameter int COMM_INTERVAL = 5   // send a token every this many cycles

    submodule a : Node<COMM_INTERVAL>
    submodule b : Node<COMM_INTERVAL>
    submodule c : Node<COMM_INTERVAL>
    submodule d : Node<COMM_INTERVAL>

    // nets for every directed pair in the clique
    net ab : capacity 4 width 4
    net ac : capacity 4 width 4
    net ad : capacity 4 width 4
    net ba : capacity 4 width 4
    net bc : capacity 4 width 4
    net bd : capacity 4 width 4
    net ca : capacity 4 width 4
    net cb : capacity 4 width 4
    net cd : capacity 4 width 4
    net da : capacity 4 width 4
    net db : capacity 4 width 4
    net dc : capacity 4 width 4

    // outports of a
    a.out0 => ab    a.out1 => ac    a.out2 => ad
    // outports of b
    b.out0 => ba    b.out1 => bc    b.out2 => bd
    // outports of c
    c.out0 => ca    c.out1 => cb    c.out2 => cd
    // outports of d
    d.out0 => da    d.out1 => db    d.out2 => dc

    // inports of a
    a.in0 <= ba     a.in1 <= ca     a.in2 <= da
    // inports of b
    b.in0 <= ab     b.in1 <= cb     b.in2 <= db
    // inports of c
    c.in0 <= ac     c.in1 <= bc     c.in2 <= dc
    // inports of d
    d.in0 <= ad     d.in1 <= bd     d.in2 <= cd

    init
    $
    // seed each node's RNG with a unique value
    a.seed = 1;
    b.seed = 2;
    c.seed = 3;
    d.seed = 4;
    $

end module


module Node
    parameter int COMM_INTERVAL = 5

    // three outports and three inports (one per neighbour)
    outport out0, out1, out2 : width 4
    inport  in0,  in1,  in2  : width 4

    include $#include <ctime>$
    include $#include <cstdlib>$

    decl
    $
    int seed;
    unsigned int tokens_received;

    // busy-wait for approximately 'ms' milliseconds
    void heavyCompute(int ms)
    {
        clock_t start = clock();
        clock_t duration = (clock_t)(ms * (CLOCKS_PER_SEC / 1000));
        volatile int x = 0;
        while ((clock() - start) < duration)
            x++;   // burn CPU
    }
    $

    init
    $
    seed = 0;
    tokens_received = 0;
    $

    behavior
    do
        wait until (this_phase == 0);

        // pull from all inports
        $
        token<4> t;
        int val;
        outport<4>* outs[3] = { &out0, &out1, &out2 };
        inport<4>*  ins[3]  = { &in0,  &in1,  &in2  };
        for (int i = 0; i < 3; i++)
            if (ins[i]->pull(t))
            {
                sitar::unpack(t, val);
                tokens_received++;
                log << endl << "received val=" << val
                    << " (total=" << tokens_received << ")";
            }
        $;

        // heavy computation in phase 0
        $heavyCompute(1);$;  // ~1 ms

        wait until (this_phase == 1);

        // heavy computation in phase 1
        $heavyCompute(1);$;  // ~1 ms

        // send a token to a random neighbour every COMM_INTERVAL cycles
        $
        if (this_cycle % COMM_INTERVAL == 0)
        {
            srand(seed + (int)this_cycle);
            int dest = rand() % 3;
            token<4> t;
            int val = (int)this_cycle;
            sitar::pack(t, val);
            t.ID = this_cycle;
            if (outs[dest]->push(t))
                log << endl << "sent val=" << val << " to out" << dest;
        }
        $;

        wait;
    while (1) end do;
    end behavior

end module

Download 5_parallel_simple.sitar

The communication structure:

flowchart LR
    subgraph TOP
        subgraph sys["sys (System)"]
            A["a (Node)"]
            B["b (Node)"]
            C["c (Node)"]
            D["d (Node)"]
            A <-->|"ab / ba"| B
            A <-->|"ac / ca"| C
            A <-->|"ad / da"| D
            B <-->|"bc / cb"| C
            B <-->|"bd / db"| D
            C <-->|"cd / dc"| D
        end
    end

Compiling and Running¶

Compile without OpenMP for a serial baseline:

sitar translate 5_parallel_simple.sitar
sitar compile --no-logging
time ./sitar_sim 20

Then compile with OpenMP and compare:

sitar compile --openmp --no-logging
export OMP_NUM_THREADS=4
time ./sitar_sim 20

The time command reports wall-clock elapsed time. With 4 modules each doing ~2 ms of work per cycle (1 ms per phase), the serial run takes approximately 20 cycles x 2 ms = 40 ms. With 4 threads you should see close to 4x speedup, approaching 10 ms.

Setting the Number of Threads¶

The number of threads is controlled by the OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=1   # effectively serial
export OMP_NUM_THREADS=2
export OMP_NUM_THREADS=4   # one thread per module for this example

Set OMP_NUM_THREADS to the number of modules (or a divisor of it) for best load balance with the default static schedule.

Customizing Module-to-Thread Mapping¶

By default, flattenHierarchy collects every module in the hierarchy — including container modules that have no ongoing behavior — and distributes them round-robin across threads. For most models this is fine, but for large regular structures (such as an N×M mesh) it is more efficient to run only the leaf compute modules in parallel and leave structural container modules out of the list entirely.

To do this, supply a custom main.cpp at compile time:

sitar compile -m custom_main.cpp --openmp --no-logging

Selecting specific modules by name¶

For small models with named submodules, build the list explicitly:

vector<module*> modules_to_run;
modules_to_run.push_back(&TOP->sys.a);
modules_to_run.push_back(&TOP->sys.b);
modules_to_run.push_back(&TOP->sys.c);
modules_to_run.push_back(&TOP->sys.d);

With OMP_NUM_THREADS=2 and schedule(static), OpenMP assigns the first half of the list to thread 0 and the second half to thread 1.

Selecting all children of a parent (for arrays)¶

For models that use submodule_array — where individual instances cannot be named explicitly in C++ — iterate over the parent module's _submodules map instead:

void buildModuleList(vector<sitar::module*>* list, sitar::module* parent)
{
    for (auto it = parent->_submodules.begin();
              it != parent->_submodules.end(); ++it)
        list->push_back(it->second);
}

// In main, after instantiating TOP:
vector<sitar::module*> modules_to_run;
buildModuleList(&modules_to_run, &(TOP->system));

This adds all direct children of system (i.e., all node[i][j] instances in a mesh) without naming them individually. TOP and system are left out of the list; they run implicitly via runHierarchical in serial mode, or are simply unused if they have no ongoing behavior.

For a two-level hierarchy (e.g., the children of system are themselves arrays), call buildModuleList recursively or iterate two levels deep as needed.

Important Considerations¶

Logging in Parallel Mode¶

In parallel execution, multiple modules run concurrently. Writing to a shared output stream (such as std::cout) from multiple threads simultaneously will interleave log lines unpredictably. Sitar handles this by assigning each module its own log file in parallel mode:

string log_name = modules_to_run[i]->hierarchicalId() + "_log.txt";
logstreams[i]->open(log_name.c_str());
modules_to_run[i]->log.setOstream(logstreams[i]);

This produces one log file per module (e.g. TOP.sys.a_log.txt, TOP.sys.b_log.txt, etc.), each written exclusively by one module. The files can be inspected individually or merged and sorted by timestamp after simulation.

Warning

Never share a single output stream across modules in parallel mode. Even if individual << calls are individually atomic, multi-field log lines will interleave across threads, producing unreadable output.

Random Number Generation¶

If your modules use random number generation, each module must use its own independent random number generator. Sharing a single generator across threads without locking causes data races and non-deterministic (and incorrect) results.

The recommended pattern is to declare a generator as a member of each module and seed it uniquely before simulation starts:

decl $int seed;$
init $seed = 0;$

Then in the main, before the parallel loop, assign a unique seed to each module:

TOP->sys.a.seed = 101;
TOP->sys.b.seed = 202;
TOP->sys.c.seed = 303;
TOP->sys.d.seed = 404;

Inside the module behavior, use seed (combined with this_cycle for additional variation if needed) to initialize calls to srand / rand or any other generator:

$srand(seed + (int)this_cycle);$;

Warning

Do not use a global srand call or a shared rand() in parallel simulation. Each execution thread must have its own generator state, seeded independently.

What's Next¶

Return to the Language and Examples section to learn the full Sitar modeling language, or jump directly to Advanced Examples for complete working models.