Project Nord · 140M SNN Language Model · Built from Scratch

Inside the Spiking Engine

Standard AI models perform dense, expensive matrix multiplications for every single token. Biological brains do not. They use sparse, 1-bit binary spikes, firing only 3-9% of the time. This page traces one word through Nord's v4 computational pipeline, step by step, with each animation tied back to the current source code and clearly labeled when it uses a small readability-focused toy view.

Project Nord solves five problems that every prior SNN language model either failed at or avoided entirely, all in one novel, from-scratch architecture. No distillation, no teacher transformer. Let's trace the journey.

Step 1 · Time Expansion

The Multi-Scale Temporal Encoder

When you type "Hello", a standard transformer turns it into a single 496-dimensional float vector and stops there. A spiking network can't process a frozen static frame – it requires a stream of input current flowing across time.

Nord's TemporalSpikeEncoder solves this by expanding that one static vector into 10 discrete timesteps of injected current, split across two functional scales:

nord_v4_models/nord_core_v4.py · TemporalSpikeEncoder.forward()
# Token embedding + learned temporal projection
x = temporal_proj(embed(token_ids))     # (B×S, D=496)

# Fast path: T=8 steps, each with its own learned gate per dimension
fast_gates = torch.sigmoid(fast_basis)   # (8, 496) – learned per-step gate
fast = fast_gates * x * drive_scale      # drive_scale ≈ 25.0

# Slow path: T_slow=2 steps, weaker amplitude
slow_gates = torch.sigmoid(slow_basis)   # (2, 496) – different learned gate
slow = slow_gates * x * slow_scale       # slow_scale ≈ 8.0

# Concatenate → (10, B×S, 496) injected current tensor
current_in = torch.cat([fast, slow], dim=0)
Toy view The chart below shows one representative scalar slice out of a 496-dimensional current tensor. The real encoder applies learned per-dimension gates across all 496 channels at every timestep.
The static word embedding (bottom bar) is gated and scaled into 10 temporal current streams. This canvas shows a toy scalar slice of one 496-dimensional current tensor. Teal bars (T1-T8, fast) follow the larger v4 drive scale of 25.0. Amber bars (T9-T10, slow) follow the smaller slow scale of 8.0. Each bar's height shows injected current for that timestep after learned gating.
Step 2 · Binary Conversion

The AssociativeLIF Neuron – Integrate, Fire, Reset

We now have 10 timesteps of continuous current. The AssociativeLIF (Leaky Integrate-and-Fire) block converts this stream into 1-bit binary spikes, achieving the 91% sparsity that makes Nord energy-efficient.

Each timestep runs three coupled equations: the synaptic current (i_syn) integrates incoming drive with exponential decay (tau_syn = 0.5), the membrane voltage (v_mem) integrates that current with leak (tau_mem = 0.85), and if voltage crosses v_threshold = 0.12 a spike fires. The neuron then enters a hard refractory period of 2 timesteps where voltage is clamped to v_reset = -0.1, preventing any re-firing.

After each spike, Cascade amplification (Step 3) injects a sub-threshold ripple back into i_syn to keep gradient flow alive.

Both v_threshold and tau_mem are learnable, clamped to safe ranges ([0.05, 0.5] and [0.8, 0.98] respectively) to prevent instability. An Auxiliary Spike Regulator monitors each layer's firing rate, targeting 3% with an asymmetric penalty: under-firing is penalized harder than over-firing, and any layer dropping below 1% triggers a 10× anti-death correction. This homeostatic loop keeps sparsity at a stable 91%.

nord_v4_models/nord_core_v4.py · AssociativeLIF.forward() - per-timestep loop
for t in range(T_total):           # T_total = 10
    # 1. Synaptic integration (low-pass filter on input)
    i_syn = beta_syn * i_syn + current_in[t]     # beta_syn ≈ 0.50

    # 2. Membrane voltage: leak or clamp during refractory
    if refractory:
        v_mem = v_reset   # = -0.1, hard clamp
    else:
        v_mem = beta_mem * v_mem + (1-beta_mem) * i_syn   # beta_mem ≈ 0.85

    # 3. Fire if above threshold (ATan surrogate gradient for backprop)
    spike = spike_fn(v_mem, threshold)           # → 0 or 1

    # 4. Soft reset + cascade ripple injection
    v_mem = v_mem - spike * threshold            # soft reset (≈ 0)
    if spike.sum() > 0:
        i_syn += cascade_amplify(spike)          # neighbor ripple (Step 3)
    refrac_counter = set_refractory(spike, 2)  # 2-step lock
Oscilloscope trace of a single LIF neuron. Bottom (Green) = filtered synaptic current i_syn. Middle (Teal) = membrane voltage v_mem. Notice the sub-threshold cascade injection (amber): it raises v_mem slightly but doesn't trigger a spike, keeping gradients alive. When the main burst pushes v_mem over the threshold 0.12, Top (Coral) spikes fire, and the neuron enters a 2-step refractory clamp at −0.1.
Step 3 · Preventing Death

The Associative Cascade – Keeping Gradients Alive

Here's the central paradox of deep spiking networks: efficiency demands silence, but silence kills learning. At 91% sparsity, backpropagation gradients multiply through seas of zeros and vanish before reaching early layers. Nord's Associative Cascade solves this.

The D=496 neurons are partitioned into 64 topological clusters arranged in a ring. When neurons in a cluster fire, they send a soft ripple of sub-threshold current through a learnable 64×64 weight matrix. In the real code that matrix is passed through a sigmoid, so the cascade is soft and non-negative everywhere. Nearby clusters start with stronger couplings, and this view emphasizes those strongest local links so the structure stays readable.

Crucially, the neighbor weights and per-cluster gains are learned during training, so the network decides which cluster topologies are most useful. (Notice that the cascade also reverberates back into the originating cluster! While the neuron that just fired is protected by a hard refractory period, its silent neighbors in the same cluster receive the sub-threshold ripple, priming them for subsequent timesteps.)

nord_v4_models/nord_core_v4.py · AssociativeLIF._cascade_amplify()
# Group D=496 neurons into nc=64 clusters
cluster_fire = scatter_add(spikes, cluster_ids) / (D // nc)  # (B, 64)

# Learned soft neighbor weight matrix – sigmoid keeps weights in (0,1)
W = torch.sigmoid(neighbor_weights)             # (64, 64)

# Each cluster receives weighted sum of neighbors' spike rates
neighbor_signal = (W @ cluster_fire.T).T          # (B, 64)

# Per-cluster learnable gain (not one global scalar)
neighbor_signal = neighbor_signal * cluster_gain   # (B, 64)

# Scatter back to neurons by cluster membership → added to i_syn
return neighbor_signal.gather(1, cluster_ids)     # (B, D=496)
🖱 Hover to trigger spikes
Interactive: How does a 496-dimensional tensor use a flat ring? Through Scatter-Gather. The Outer Ring represents the 496 individual neurons in a layer. The Inner Ring represents the 64 topological clusters. Hover to trigger an outer neuron (coral). Watch it send its spike down to its cluster (scatter_add). The cluster sends lateral ripples along the inner ring (W matrix). This view emphasizes the strongest local couplings for readability. In the real model the learned sigmoid-bounded matrix stays soft and non-negative across all cluster pairs. The returning signal (gather) raises constituent neurons into a teal sub-threshold state.
Step 4 · Attention Replacement

Sparse Spiking Synaptic Resonance

Standard Transformers use Self-Attention: an expensive O(S²) operation that computes a similarity score between every pair of tokens using continuous float vectors. Nord replaces this entirely with Spiking Synaptic Resonance.

Query and Key projections are each passed through their own LIF neurons, producing binary spike patterns across T=10 timesteps. Rather than naively flattening time and head dimensions, v4.2 uses learned temporal mixing weights: each timestep gets a softmax-normalized importance score, Q/K are formed as weighted sums over time, and RoPE is applied before the resonance matrix is scored. The resonance score between positions is then the dot product of these temporally mixed representations.

To save memory, only the Top-K = min(64, S) resonance scores per query position are kept. All others are zeroed out before softmax, making attention 87.5% sparse for typical sequence lengths.

nord_v4_models/nord_core_v4.py · SpikingSynapticResonance.forward()
# Project to Q, K, V – then spike Q and K independently
q_spikes, _ = lif_q(W_q(x))  # (T=10, B×S, D) binary spikes
k_spikes, _ = lif_k(W_k(x))  # (T=10, B×S, D) binary spikes

# FIX E: Learned temporal mixing (not naive T*Dh flattening)
tw_q = softmax(temporal_mix_q)    # (T,) learned per-timestep weight
tw_k = softmax(temporal_mix_k)    # (T,) learned per-timestep weight
q = (q_spikes * tw_q).sum(dim=0)  # (B, H, S, Dh) weighted sum over time
k = (k_spikes * tw_k).sum(dim=0)  # preserves spike timing semantics
resonance = q @ k.transpose(-2, -1)  # (B, H, S, S)

# Causal mask (no peeking at future tokens)
resonance.masked_fill_(future_mask, float("-inf"))

# Top-K Sparsity: keep only top-64 per query row
sparse_res = full(-inf)
sparse_res.scatter_(-1, top_k_indices, top_k_values)  # zero-mask the rest
attn = softmax(sparse_res)
Toy view The canvas below uses 10 tokens and keeps top-3 links per query so the pattern stays legible. The real v4 code mixes Q and K across time, applies RoPE, and then keeps K = min(64, S).
Left: binary Q-spike patterns (teal = fired). Top: binary K-spike patterns. Center: resonance heatmap – brighter = more co-firing (stronger connection). The patterns dynamically regenerate to simulate continuous processing. The scanner (coral row) processes one query position at a time and applies Top-K masking: in this toy view only the top-3 strongest connections per row survive so the plot stays readable. The real v4 code keeps K = min(64, S) after temporal mixing and RoPE. RoPE itself is part of the code path but is not visualized in this small canvas.
Step 5 · Expert Routing

Spike-Driven Mixture of Experts

In the association zone (blocks 3–4), each token is routed to 2 of 4 specialized experts based on its spike-rate pattern. Unlike standard MoE that routes on dense token embeddings, Nord's router computes per-cluster firing rates from the binary spike output, groups them by expert, and selects the top-2 highest-scoring experts per token.

Each expert is a full up → LIF → down → LIF sub-network. Dispatch is vectorized: the code loops over 4 experts (not 2048 tokens), extracting each expert's assigned token batch in one operation. A load-balancing loss (0.01 × n_experts × Σ(freq × prob)) prevents expert collapse – ensuring all 4 experts stay utilized.

nord_v4_models/nord_core_v4.py · SpikeDrivenMoE.forward()
# Route LIF: spike the input to get binary patterns
route_spikes, _ = route_lif(x)          # (T, N, D)
# Per-cluster firing rates → grouped by expert
cluster_rates = scatter_add(spikes) / cluster_size
expert_scores = cluster_rates.reshape(N, 4, 16).mean(-1)

# Top-2 expert selection + softmax weights
top_scores, top_idx = topk(expert_scores, k=2)
weights = softmax(top_scores)           # (N, 2)

# Vectorized dispatch: loop over 4 experts, not N tokens
for e in range(4):
    active = (top_idx == e).any(dim=-1)
    output[:, active] += expert_e(x[:, active]) * w[active]
Tokens arrive (top) and the spike-rate router assigns each to 2 of 4 experts. Each expert column has its own LIF neurons (internal spike patterns). The load-balance meter (bottom) shows expert utilization – all 4 should stay roughly equal.
Step 6 · Persistent Memory

Memory Cortex - Gated Temporal Memory

Between the association blocks and the executive zone, Nord maintains a 128-slot persistent memory bank. Unlike attention (which is reset per forward pass), the Memory Cortex's LIF neurons use tau_mem = 0.99, a near-zero decay that retains information across many tokens.

The write path is persistent and direct: input is projected into memory space and integrated by memory_lif. Separately, a gate path projects the same input through gate_lif, averages its activity across time, and converts that into a read gate with a learned gate_threshold = 0.3. 4 attention read heads then read across the full memory membrane trace over all timesteps, and the gated readout is mixed back into the main signal through the learnable memory_mix parameter.

nord_v4_models/nord_core_v4.py · MemoryCortex.forward()
# Write: project input → persistent memory LIF (τ=0.99)
mem_spikes, v_mem = memory_lif(to_memory(x))   # (T, N, 128)

# Gate: build a read gate from gate spikes, separate from writes
gate_sig = gate_spikes.mean(dim=0)
gate = sigmoid((gate_sig - gate_threshold) * 10)

# Read: 4-head temporal attention over ALL timesteps
attn = softmax(read_query · read_key(v_mem))   # (T, N, 4)
mem_read = (v_mem * attn).sum(dim=0) * gate    # gated readout

# Mix back into main signal
x = x + memory_mix * from_memory(mem_read)     # mix ≈ 0.1
Read gate In v4 the gate scales the memory readout after temporal attention. It does not block the persistent write path into memory_lif.
The 128 memory slots (grid) show persistent state. The amber write bar on the left drives memory updates independently of the gate. The green bar on the right is the read gate, which scales how much of the temporally attended readout is mixed back into the main stream. The four colored rows below the grid show temporal read attention across the 10 internal timesteps. The persistent LIF (τ=0.99) barely decays compared with the normal τ=0.85 reference trace.
Step 7 · Code Appendix

Reward-Modulated STDP - Defined in Code, Inactive in v4 Path

Inactive in current v4 path The old v3 chat path applied online STDP updates. In nord_v4_models the STDPEngine still exists, but the current NordModel.forward() and chat_v4.py path do not call those updates.

Nord v3 could update its own weights during inference using Spike-Timing Dependent Plasticity (STDP). The current v4 codebase still contains that bounded STDP engine, now restricted to the executive zone with tighter weight bounds (w_min=-0.15, w_max=0.5) and a max_update_norm=0.01 clamp. This section documents the mechanism that remains in code, not an active stage in the latest forward path.

If re-enabled, standard STDP would still be blind to whether the network is actually improving. Nord addresses that with a Reward Signal: the final weight update is multiplied by 2 × sigmoid(loss_EMA − current_loss) − 1, which maps to (+1) when predictions are improving (LTP) and (−1) when they're worsening (LTD reversal).

nord_v4_models/nord_core_v4.py · STDPEngine (defined, currently not invoked)
for t in range(T):
    # Decaying pre- and post-synaptic eligibility traces
    trace_pre  = decay_plus  * trace_pre  + pre_spikes[t]   # tau_plus  = 20
    trace_post = decay_minus * trace_post + post_spikes[t]  # tau_minus = 20

    # LTP: post fires → reinforce connections from active pre neurons
    dW += a_plus  * outer(post_spikes[t], trace_pre)         # a_plus  = 0.005
    # LTD: pre fires  → weaken connections to recently-active post
    dW -= a_minus * outer(trace_post,     pre_spikes[t])     # a_minus = 0.005

# Reward modulation: aligns local Hebbian with global LM loss
reward = sigmoid(loss_EMA - current_loss)       # 1.0 = improving, 0.0 = worsening
dW_final = dW * (2.0 * reward - 1.0)          # maps to (−1, +1)
# Magnitude bound + zone check (FIX F: executive zone only)
if dW.norm() > max_update_norm:          # max_update_norm = 0.01
    dW = dW * (max_update_norm / dW.norm())
layer.weight += dW_final.clamp(-0.15, 0.5)  # w_min=-0.15, w_max=0.5
Four synchronized traces. Pre-neuron spikes (coral) launch decaying eligibility traces (Track 2). Post-neuron spikes (teal) launch their own traces (Track 3). The bottom track maps the final reward-modulated ΔW updates: green bars indicate synaptic strengthening (LTP), while red bars indicate weakening (LTD). This is an explanatory appendix view only. The current v4 inference path does not apply STDP updates.
Step 8 · Protecting State

LeakyClamp – Keeping the Sub-Threshold Ghost Alive

In virtually every modern neural network, ReLU is the non-linearity: for any positive input, pass it through; for any negative input, return exactly zero. For a traditional classification network this is fine. For a spiking network it is fatal.

A LIF neuron's natural resting voltage is v_reset = -0.1 – intrinsically negative. ReLU annihilates this: the entire sub-threshold state, which carries rich information about how close a neuron is to firing, becomes identically zero. Gradients cannot flow through zero; neurons can never recover from silence.

In the sensory and association zones, Nord uses LeakyClamp. For positive values it's identity; for negative values it applies a learned per-channel leak slope (≈0.1) – gently compressing negative signals down to a learnable floor (initialized at −0.1). The sub-threshold ghost is preserved.

The executive zone is the exception: it uses force_nonneg=True, which applies standard F.relu(x) instead. This intentionally kills negative spike propagation before the readout, ensuring only clean positive activations reach the output head.

nord_v4_models/nord_core_v4.py · LeakyClamp.forward()
# FIX M: Executive zone forces non-negative output
if self.force_nonneg:
    return F.relu(x)                    # executive: no negatives allowed

# Sensory / Association zones: preserve sub-threshold state
neg_part = (self.leak * x).clamp(min=self.floor)  # leak ∈ (0,1)
return torch.where(x >= 0, x, neg_part)       # floor ≈ -0.1

# leaky_clamp(-0.1) → 0.1*(-0.1) = -0.01 (preserved!)
# executive(-0.1) → relu(-0.1) = 0.0 (clean output)
Three non-linearity regimes in v4.2: Standard ReLU (left) kills everything below zero - dead neurons, no recovery. LeakyClamp (center) preserves negatives with gentle slope (0.1×) and learnable floor at −0.1 - used in sensory and association zones. Executive zone (right) intentionally uses force_nonneg=True (F.relu) to prevent negative spike contamination before readout - same shape as ReLU, but by design, not by accident.
Step 9 · The Final Bottleneck

EMA Temporal Readout - From Spikes to Floats

After 6 zonal blocks plus the Memory Cortex, the network holds 10 timesteps of scattered 1-bit binary spikes. How do you accurately predict which of 128,000 English words comes next from scattered ones and zeros? You don't.

The final readout_lif passes the spike stream through one last LIF step to accumulate its continuous 16-bit membrane potential tensor v_membrane (10 × S × 496). Rather than doing a simple mean over timesteps (which discards temporal ordering), Nord applies an Exponential Moving Average, giving the most recent timestep the highest weight. The learnable decay α ≈ 0.8 means the contribution of timestep t scales as (1−α)·α^(9−t).

This hybrid readout (smoothed membrane potential + mean spike rate) bypasses the fundamental 1-bit bottleneck that makes SNN language modeling "impossible."

nord_v4_models/nord_core_v4.py · NordModel.forward() - hybrid readout
# Readout LIF: spikes AND membrane potential are returned
readout_spikes, v_membrane = readout_lif(x_flat)  # (T=10, B×S, D=496)

# EMA Temporal Smoothing Readout
alpha = sigmoid(readout_ema_raw)   # learnable α ≈ 0.80
ema = zeros(B*S, D)
for t in range(T_total):         # t=0..9
    ema = alpha * ema + (1 - alpha) * v_membrane[t]
# Weight of v[t] in final ema = (1-α)·α^(9−t)
# t=9 → 0.200 (most recent, highest)   t=0 → 0.027 (oldest, lowest)

# Hybrid: smoothed membrane + spike rate → LM head → 128k logits
readout = ema.reshape(B,S,D) + readout_spikes.mean(dim=0).reshape(B,S,D)
logits  = lm_head(layernorm(readout))  # (B, S, 128256)
10 membrane potential "frames" (rows, T0→T9) each carrying a 496-dimensional float signal across the horizontal columns. EMA weight (shown to the left of each row) grows exponentially: T9 contributes 7× more than T0. The bottom rows show the real hybrid readout: EMA(v_mem), the mean spike contribution, and the final EMA(v_mem) + mean(readout_spikes) vector that is fed into the 128k-vocab language head.
Global Map

The Complete Forward Pass

Everything assembled: one token enters at the top as an integer index. It exits at the bottom as next-token logits over 128,256 vocabulary entries. Every stage has a dedicated visualization above. This final view shows them all connected, with flowing particles representing the data tensor as it transforms through Nord's architecture.

Notice how the tensor shape changes at each critical stage: [1] (token) → [496] (embed) → [10×496] (temporal current) → [10×496] (input spikes) → [10×496] (sensory) → [10×496] (association / MoE) → [10×496] (memory mix-back) → [10×496] (executive) → [10×496] (readout membrane + spikes) → [496] (hybrid readout) → [128k] (logits).

Top-to-Bottom vertical pipeline. Watch the dense token embedding expand into 10 temporal lanes (teal = fast, amber = slow). At the LIF stages, the continuous data becomes 91% sparse across 3 specialized zones (Sensory, Association, Executive), with the Memory Cortex sitting between association and executive. Finally, it reconverges through the hybrid readout and expands into 128k next-token logits at the bottom.