Skip to content

Pipelined Processor

This example models a 4-stage pipelined processor that executes 2 hardware threads scheduled in round-robin.

Objectives

The purpose of this example is to demonstrate how to build a synchronous pipeline in Sitar, where data transfer between stages happens on the same clock edge with zero latency between adjacent stages. This cannot be achieved by modeling each stage as a separate module connected by nets, because every net incurs Sitar's minimum one-cycle communication latency, which would stretch the pipeline out in time.

Instead, a tightly coupled synchronous system such as this is modeled using a single module containing a parallel block with one procedure per stage. Stages communicate via shared C++ variables (not nets) owned by the parent module. Branches of a parallel block execute until convergence within a phase, and synchronization on simple shared variables is safe and deterministic as long as they reside within a single module.

What this example demonstrates:

  • Modeling a multi-stage pipeline in Sitar as a parallel block with procedures
  • Inter-stage communication via shared C++ variables guarded by valid bits
  • Zero-overhead thread interleaving (barrel-processor pattern)
  • Procedures accessing parent-module state via pointers, configured in the parent's init
  • A simple per-stage stop criterion based on an instruction-retired counter
  • Per-cycle tabular logging driven by a dedicated branch of the parallel block

This is a multi-file example. Full source:


Architecture

flowchart LR
    F["FETCH"] -->|"stage_inputs[1]"| D["DECODE"] -->|"stage_inputs[2]"| E["EXECUTE"] -->|"stage_inputs[3]"| W["WRITEBACK"]
  • 2 threads (thread 0 and thread 1) alternate each cycle in round-robin
  • 4 pipeline stages, each an instance of a single Stage procedure running in a parallel block
  • Information transfer between stages is via shared PipelineReg structs (fields: valid, thread_id, pc). Adjacent stages share a register: stage i's stage_output and stage i+1's stage_input point to the same shared variable
  • The upstream stage writes the register and sets valid=true; the downstream stage reads it and later sets valid=false (when it forwards the instruction along). This valid bit is the inter-stage handshake
  • All shared state is bundled into a ThreadData struct that each stage holds by reference
  • Each stage acquires an instruction into its stage_input, waits DELAY cycles to model processing, and then commits to the next stage's input register. The last stage (Writeback) has no downstream and simply retires the instruction
  • This models an elastic pipeline: each stage's DELAY is an independent parameter. With all DELAY=1 we get the full-throughput behavior. A new instruction enters the pipeline every cycle and one retires every cycle after the fill-up phase. With different delay values per stage, it behaves as an elastic pipeline.

Dummy Stages

This example illustrates how to model the structure and timing of the pipeline. The actual functionality of each stage is not modeled. As a placeholder, each stage simply carries the thread_id and pc of the instruction, and the parent module logs one line per cycle showing each stage's current occupancy.


Shared data structures

The PipelineTypes.h header defines the two structs shared between the parent module and every Stage procedure instance:

#ifndef PIPELINE_TYPES_H
#define PIPELINE_TYPES_H

// A per-stage pipeline register. Holds the instruction metadata
// (thread_id, pc) that one stage hands off to the next, plus a
// valid bit that acts as the inter-stage handshake.
struct PipelineReg {
    bool valid;
    int  thread_id;
    int  pc;
};

// ThreadData bundles all per-stage pointers into parent-owned state.
// Each Stage procedure instance holds its own ThreadData, wired up
// by the parent module (Pipelined_Processor) in its init block.
struct ThreadData {
    int          num_threads;    // snapshot of parent's NUM_THREADS (for Fetch round-robin)
    int*         pc;             // -> parent's pc[num_threads] array
    int*         active_thread;  // -> parent's active_thread
    PipelineReg* stage_input;    // this stage's working register (currently processed instr)
    PipelineReg* stage_output;   // next stage's input register; nullptr on the last stage
};

#endif

PipelineReg is one pipeline register (the data that flows between two adjacent stages). ThreadData is the full set of pointers each stage needs: the shared thread state (num_threads, pc[], active_thread) and this stage's own input/output registers. Each Stage procedure instance owns one ThreadData value; the parent wires up its fields in the parent's init.


Top

The Top module simply instantiates one processor, supplying the single template parameter NUM_THREADS:

1
2
3
4
module Top
    // NUM_THREADS=2 => two interleaved hardware threads.
    submodule proc : Pipelined_Processor<2>
end module

Pipelined_Processor module

The processor owns the shared state (per-thread PCs, active thread, all four pipeline registers) and instantiates four Stage procedure instances named fetch, decode, execute, writeback. Its init block sets the id, name, and ThreadData pointers on each stage, chains the registers to form the pipeline, and installs the stop criterion on writeback.

Init block and construction order

When sitar translate converts a module description to a C++ class, the content inside the init block gets placed inside its constructor. If a parent module instantiates a child submodule, the child's init gets executed before the parent's init, in accordance with the order in which C++ constructors get executed. Thus in a child module's init, default or initial values can be assigned to its variables, which can later be finalized/updated by the parent module's init, as illustrated by the following example.

module Pipelined_Processor

    // Number of hardware threads. Fetch round-robins through them.
    parameter int NUM_THREADS = 2

    // NUM_STAGES is fixed at 4 (Fetch/Decode/Execute/Writeback).

    include 
    $
    #include "PipelineTypes.h"
    #include <iomanip>
    #include <sstream>
    $

    // Shared state owned by the processor and pointed to by every stage
    // via ThreadData.
    decl 
    $
    static const int NUM_STAGES = 4;

    // Per-thread program counter. Incremented by Fetch each time that
    // thread is selected.
    int pc[NUM_THREADS];

    // Index of the thread Fetch will grab next. Advances round-robin.
    int active_thread;

    // One pipeline register per stage. stage_inputs[i] is where stage i
    // holds the instruction it is currently working on. Adjacent stages
    // share a register: stage i's "stage_output" points at the same
    // memory as stage (i+1)'s "stage_input".
    //
    // The valid bit is the inter-stage handshake:
    //   - An upstream stage sets valid=true when it commits a new instr.
    //   - The downstream stage sets valid=false after it captures the
    //     instr for processing (so the upstream's commit slot is free
    //     again for the next cycle).
    PipelineReg stage_inputs[NUM_STAGES];

    // Human-readable stage names, used only in the log output.
    std::string stage_names[NUM_STAGES];
    $

    // The four stages are procedure instances. Each is parameterized by
    // its processing delay; all four happen to use DELAY=1 here, but
    // changing any value >=0 (e.g. Execute<3>) just works. The handshake
    // absorbs the change automatically. 
    // Note: At-least one of the stage must have a non-zero delay
    procedure fetch     : Stage<1>
    procedure decode    : Stage<1>
    procedure execute   : Stage<1>
    procedure writeback : Stage<1>

    // Parent's init runs AFTER every child constructor (C++ member
    // construction order). That lets us override the defaults each
    // Stage procedure set for itself (id=0, name="", nullptr pointers,
    // stop_when_total_executed=-1).
    init 
    $
    // --- initialize shared state ------------------------------------
    for (int i = 0; i < NUM_THREADS; i++) pc[i] = 0;
    active_thread = 0;
    for (int i = 0; i < NUM_STAGES; i++) {
        stage_inputs[i].valid     = false;
        stage_inputs[i].thread_id = 0;
        stage_inputs[i].pc        = 0;
    }
    stage_names[0] = "Fetch";
    stage_names[1] = "Decode";
    stage_names[2] = "Execute";
    stage_names[3] = "Writeback";

    // --- wire every stage to the shared state -----------------------
    // All stages see the same pc[] array, active_thread, and
    // num_threads. Only Fetch actually uses these three, but passing
    // them uniformly keeps the wiring symmetric.
    fetch.td.num_threads     = NUM_THREADS;
    fetch.td.pc              = pc;
    fetch.td.active_thread   = &active_thread;
    decode.td.num_threads    = NUM_THREADS;
    decode.td.pc             = pc;
    decode.td.active_thread  = &active_thread;
    execute.td.num_threads   = NUM_THREADS;
    execute.td.pc            = pc;
    execute.td.active_thread = &active_thread;
    writeback.td.num_threads   = NUM_THREADS;
    writeback.td.pc            = pc;
    writeback.td.active_thread = &active_thread;

    // --- per-stage identity + register chain ------------------------
    // Each stage's stage_input is stage_inputs[id]; its stage_output is
    // stage_inputs[id+1], except the last stage (writeback) which has
    // no output register (nullptr) and retires instructions outright.
    fetch.id = 0;     fetch.name     = stage_names[0];
    fetch.td.stage_input   = &stage_inputs[0];
    fetch.td.stage_output  = &stage_inputs[1];

    decode.id = 1;    decode.name    = stage_names[1];
    decode.td.stage_input  = &stage_inputs[1];
    decode.td.stage_output = &stage_inputs[2];

    execute.id = 2;   execute.name   = stage_names[2];
    execute.td.stage_input  = &stage_inputs[2];
    execute.td.stage_output = &stage_inputs[3];

    writeback.id = 3; writeback.name = stage_names[3];
    writeback.td.stage_input  = &stage_inputs[3];
    writeback.td.stage_output = nullptr;

    // --- stop criterion ---------------------------------------------
    // Any stage can trigger `stop simulation` based on its own retired-
    // instruction count. Setting the threshold on writeback makes the
    // simulation stop once N instructions have fully traversed the
    // pipeline. Leaving the value at the procedure's default (-1)
    // disables the check for all other stages.
    writeback.stop_when_total_executed = 10;
    $

    // The entire pipeline is one parallel block: four stage branches
    // plus one logging branch. Within each phase, branches execute in
    // written order and the kernel re-runs them until all have
    // converged (hit a wait). This is what lets, e.g., fetch commit in
    // iteration 3 of a phase once decode invalidated its input in
    // iteration 2 -- i.e. one instr moves through all stages per
    // cycle.
    behavior
        [
            run fetch;
        ||
            run decode;
        ||
            run execute;
        ||
            run writeback;
        ||
            // Logging: one line per cycle, printed in PHASE 1 rather
            // than phase 0. Phase 1 sees the stable end-of-phase-0
            // state, after all convergence iterations have settled, so
            // every stage_input carries its currently-working
            // instruction. If we logged in phase 0, the snapshot would
            // catch the pipeline mid-handshake (some stages would show
            // `(---)` for one cycle as their input is momentarily
            // invalidated before the upstream commit completes in a
            // later iteration of the same phase).
            do
                wait until (this_phase == 1);
                $
                log << endl;
                for (int i = 0; i < NUM_STAGES; i++) {
                    std::ostringstream ss;
                    if (stage_inputs[i].valid)
                        ss << "(t=" << stage_inputs[i].thread_id
                           << ",pc=" << stage_inputs[i].pc << ")";
                    else
                        ss << "(---)";
                    log << "| " << std::setw(10) << std::left << stage_names[i]
                        << std::setw(11) << std::left << ss.str() << " ";
                }
                log << "|";
                $;
                wait;   // advance one phase -> next cycle's phase 0
            while (1) end do;
        ];
    end behavior
end module

Wiring pattern. stage_inputs[i] is stage i's working register. The currently processed instruction always sits there. Adjacent stages share a register:

Stage id stage_input stage_output
Fetch 0 &stage_inputs[0] (self-filled) &stage_inputs[1]
Decode 1 &stage_inputs[1] &stage_inputs[2]
Execute 2 &stage_inputs[2] &stage_inputs[3]
Writeback 3 &stage_inputs[3] nullptr (retires)

Behavior. The behavior block is a single parallel block with five branches: one run <stage>; per stage, plus a logging branch. The logging branch runs in phase 1, after all phase-0 stage activity has converged, so every stage_input.valid reflects the stable end-of-cycle pipeline state rather than an intermediate convergence state.


Stage procedure

A single Stage procedure is used by all four pipeline stages. Its body is a do-while loop with three steps: acquire, process and commit, plus bookkeeping.

// A single reusable procedure for every pipeline stage. Stage behavior
// is uniform except that the Fetch stage (id==0) has no upstream, so
// it self-fills its stage_input each iteration instead of waiting for
// a producer. The Writeback stage (id==3) has no downstream
// (stage_output==nullptr), so it retires instructions outright instead
// of committing to the next stage.
procedure Stage

    // How many cycles this stage takes to "process" one instruction.
    // Hardwired at instantiation (e.g. procedure execute : Stage<3>).
    parameter int DELAY = 1

    include $
    #include "PipelineTypes.h"
    $

    decl $
    // Identity of this stage within its parent processor. Set by the
    // parent's init after construction. id==0 means Fetch (self-fill),
    // other ids are regular stages.
    int         id;
    std::string name;

    // All pointers to parent-owned state: per-thread PCs, the
    // round-robin thread selector, and this stage's input/output
    // pipeline registers. See PipelineTypes.h.
    ThreadData  td;

    // Running count of instructions this stage has committed/retired.
    int         total_instr_executed;

    // If non-negative, the stage triggers `stop simulation` once its
    // counter reaches this value. Default -1 (disabled); the parent
    // sets a positive value on the last stage only.
    int         stop_when_total_executed;
    $
    init $
    id                       = 0;
    name                     = "";
    total_instr_executed     = 0;
    stop_when_total_executed = -1;
    td.num_threads           = 1;
    td.pc                    = nullptr;
    td.active_thread         = nullptr;
    td.stage_input           = nullptr;
    td.stage_output          = nullptr;
    $

    behavior
        do
            // -----------------------------------------------------------
            // Step 1: acquire an instruction in `stage_input`.
            //
            // Non-fetch stages wait for their upstream stage to set
            // stage_input->valid=true. The Fetch stage has no upstream,
            // so instead it waits for stage_input to be FREE (valid=0),
            // then writes the next thread's (thread_id, pc) into it and
            // advances the round-robin selector and that thread's PC.
            // -----------------------------------------------------------
            if (id == 0) then
                wait until (this_phase == 0 and $!td.stage_input->valid$);
                $
                int t = *td.active_thread;
                td.stage_input->thread_id = t;
                td.stage_input->pc        = td.pc[t];
                td.stage_input->valid     = true;
                td.pc[t]++;
                *td.active_thread = (t + 1) % td.num_threads;
                $;
            else
                wait until (this_phase == 0 and $td.stage_input->valid$);
            end if;

            // -----------------------------------------------------------
            // Step 2: model the stage's processing time.
            //
            // During these DELAY cycles stage_input stays valid, so the
            // logging branch in the parent sees this stage's currently
            // working instruction.
            // -----------------------------------------------------------
            wait(DELAY, 0);

            // -----------------------------------------------------------
            // Step 3: commit to the next stage, or retire (last stage).
            //
            // Normal stages wait for the downstream input slot to be
            // free (stage_output->valid==0), then copy the payload over
            // and flip both valid bits: stage_output becomes valid,
            // stage_input becomes free.
            //
            // The last stage has stage_output==nullptr; it just clears
            // its own stage_input valid bit.
            // -----------------------------------------------------------
            if ($td.stage_output != nullptr$) then
                wait until (this_phase == 0 and $!td.stage_output->valid$);
                $
                td.stage_output->thread_id = td.stage_input->thread_id;
                td.stage_output->pc        = td.stage_input->pc;
                td.stage_output->valid     = true;
                td.stage_input->valid      = false;
                $;
            else
                // Last stage: no downstream to forward to. Retire in
                // place by clearing the valid bit on our own input.
                wait until (this_phase == 0);
                $ td.stage_input->valid = false; $;
            end if;

            // -----------------------------------------------------------
            // Step 4: bookkeeping + stop check.
            // -----------------------------------------------------------
            $ total_instr_executed++; $;

            if ($stop_when_total_executed >= 0 && total_instr_executed >= stop_when_total_executed$) then
                $
                log << endl << name
                    << ": simulation stopped upon reaching stopping criteria, num executed="
                    << total_instr_executed;
                $;
                stop simulation;
            end if;
        while (1) end do;
    end behavior
end procedure

Acquire (step 1). Non-fetch stages wait in phase 0 for their upstream to deliver a valid input. The Fetch stage (id==0) has no upstream, so instead it waits for its input slot to be free, then self-fills it from pc[active_thread] and advances the round-robin thread selector.

Process (step 2). wait(DELAY, 0) models the stage's processing time. Because stage_input.valid stays true throughout, the logging branch in the parent sees this stage's currently working instruction during every cycle of its processing window.

Commit (step 3). Normal stages wait for the downstream input slot to be free, then copy stage_input into stage_output and flip both valid bits atomically (within a single code block): stage_output.valid = true, stage_input.valid = false. Writeback has no downstream, so it simply clears its own stage_input.valid to retire the instruction.

Stop criterion (step 4). Each stage keeps a total_instr_executed counter. If stop_when_total_executed has been set to a non-negative value (the parent does this on writeback only), the stage logs a stop message and calls stop simulation once the threshold is reached. Other stages leave the default -1 and never trigger the stop.

Intra-cycle flow through all stages

A parallel block's branches are re-run round-robin until all have converged within a phase. Consider a steady-state cycle: writeback retires its instruction (invalidating stage_inputs[3]), which unblocks execute's commit in a later iteration of the same phase; that invalidates stage_inputs[2], unblocking decode; and finally fetch commits. Net effect: one instruction moves through every stage boundary per cycle, with no extra latency.


Building and running

From the pipelined_processor/ directory:

bash compile.sh
./sitar_sim 40

compile.sh runs sitar translate PipelinedProcessor.sitar and then sitar compile -d Output/ -d ./. The extra -d ./ adds the current directory to the include path so that PipelineTypes.h is found by the generated code.


Expected output

Model size (size of TOP in Bytes):3488
Running simulation...
Maximum simulation time = 40 cycles

(0,1)TOP.proc   :| Fetch     (t=0,pc=0)  | Decode    (---)       | Execute   (---)       | Writeback (---)       |
(1,1)TOP.proc   :| Fetch     (t=1,pc=0)  | Decode    (t=0,pc=0)  | Execute   (---)       | Writeback (---)       |
(2,1)TOP.proc   :| Fetch     (t=0,pc=1)  | Decode    (t=1,pc=0)  | Execute   (t=0,pc=0)  | Writeback (---)       |
(3,1)TOP.proc   :| Fetch     (t=1,pc=1)  | Decode    (t=0,pc=1)  | Execute   (t=1,pc=0)  | Writeback (t=0,pc=0)  |
(4,1)TOP.proc   :| Fetch     (t=0,pc=2)  | Decode    (t=1,pc=1)  | Execute   (t=0,pc=1)  | Writeback (t=1,pc=0)  |
(5,1)TOP.proc   :| Fetch     (t=1,pc=2)  | Decode    (t=0,pc=2)  | Execute   (t=1,pc=1)  | Writeback (t=0,pc=1)  |
(6,1)TOP.proc   :| Fetch     (t=0,pc=3)  | Decode    (t=1,pc=2)  | Execute   (t=0,pc=2)  | Writeback (t=1,pc=1)  |
(7,1)TOP.proc   :| Fetch     (t=1,pc=3)  | Decode    (t=0,pc=3)  | Execute   (t=1,pc=2)  | Writeback (t=0,pc=2)  |
(8,1)TOP.proc   :| Fetch     (t=0,pc=4)  | Decode    (t=1,pc=3)  | Execute   (t=0,pc=3)  | Writeback (t=1,pc=2)  |
(9,1)TOP.proc   :| Fetch     (t=1,pc=4)  | Decode    (t=0,pc=4)  | Execute   (t=1,pc=3)  | Writeback (t=0,pc=3)  |
(10,1)TOP.proc  :| Fetch     (t=0,pc=5)  | Decode    (t=1,pc=4)  | Execute   (t=0,pc=4)  | Writeback (t=1,pc=3)  |
(11,1)TOP.proc  :| Fetch     (t=1,pc=5)  | Decode    (t=0,pc=5)  | Execute   (t=1,pc=4)  | Writeback (t=0,pc=4)  |
(12,1)TOP.proc  :| Fetch     (t=0,pc=6)  | Decode    (t=1,pc=5)  | Execute   (t=0,pc=5)  | Writeback (t=1,pc=4)  |
(13,0)TOP.proc.writeback:Writeback: simulation stopped upon reaching stopping criteria, num executed=10
Simulation stopped at time (13,0)
  • Cycles 0–2: fill-up. Each cycle a new instruction enters Fetch and everything downstream shifts right by one.
  • Cycle 3 onward: steady state. All four stages are active; one instruction retires every cycle. The two threads alternate (round-robin active_thread), and each thread's PC advances independently.
  • Cycle 13: stop. Writeback retires its 10th instruction at cycle 12 (visible in the (12,1) log line), its stop check fires immediately, and stop simulation halts the run at (13,0).

The (N,1) prefix on the log lines comes from the logging branch running in phase 1 that's the stable end-of-cycle snapshot. The final stop message has prefix (13,0)TOP.proc.writeback: because it's emitted from the Writeback stage's own logger at phase 0.


Adapting this pattern

Varying pipeline depth or stage delays

To add a longer execute phase, change the instantiation: procedure execute : Stage<3>. The handshake absorbs the change automatically. Fetch and Decode will stall cleanly whenever Execute is busy, and the throughput will drop accordingly. To add more stages or more threads, extend NUM_STAGES (and the stage_names[] array) or change the NUM_THREADS template argument in Top.

Real stage functionality

The stages here carry only thread_id and pc. To model real behavior, extend PipelineReg with additional fields (opcode, operands, result), have Fetch populate them (e.g. from an instruction memory submodule), and have Execute/Writeback act on them. The pipeline skeleton stays the same.