The Multi-Scale Temporal Encoder
When you type "Hello", a standard transformer turns it into a single 496-dimensional float vector and stops there. A spiking network can't process a frozen static frame – it requires a stream of input current flowing across time.
Nord's TemporalSpikeEncoder solves this by expanding that
one static vector into
10 discrete timesteps of injected current, split
across two functional scales:
- T_fast (8 steps, scale = 25.0): High-amplitude, rapidly-varying drive. Each step uses a different learned sigmoid gate over the 496 dims, making it volatile – ideal for capturing local morpheme-level details.
- T_slow (2 steps, scale = 8.0): Gently suppressed, stable drive. Lower amplitude anchors the network to a broader "context summary" without overpowering the fast path.
# Token embedding + learned temporal projection
x = temporal_proj(embed(token_ids)) # (B×S, D=496)
# Fast path: T=8 steps, each with its own learned gate per dimension
fast_gates = torch.sigmoid(fast_basis) # (8, 496) – learned per-step gate
fast = fast_gates * x * drive_scale # drive_scale ≈ 25.0
# Slow path: T_slow=2 steps, weaker amplitude
slow_gates = torch.sigmoid(slow_basis) # (2, 496) – different learned gate
slow = slow_gates * x * slow_scale # slow_scale ≈ 8.0
# Concatenate → (10, B×S, 496) injected current tensor
current_in = torch.cat([fast, slow], dim=0)
The AssociativeLIF Neuron – Integrate, Fire, Reset
We now have 10 timesteps of continuous current. The
AssociativeLIF (Leaky Integrate-and-Fire) block converts
this stream into 1-bit binary spikes, achieving the
91% sparsity that makes Nord energy-efficient.
Each timestep runs three coupled equations: the
synaptic current (i_syn) integrates
incoming drive with exponential decay (tau_syn = 0.5),
the membrane voltage (v_mem) integrates
that current with leak (tau_mem = 0.85), and if voltage
crosses v_threshold = 0.12 a spike fires. The neuron then
enters a hard refractory period of 2 timesteps where
voltage is clamped to v_reset = -0.1, preventing any
re-firing.
After each spike, Cascade amplification (Step 3)
injects a sub-threshold ripple back into i_syn to keep
gradient flow alive.
Both v_threshold and tau_mem are
learnable, clamped to safe ranges
([0.05, 0.5] and [0.8, 0.98] respectively) to prevent instability.
An Auxiliary Spike Regulator monitors each layer's
firing rate, targeting 3% with an asymmetric penalty: under-firing
is penalized 3× harder than over-firing, and
any layer dropping below 1% triggers a 10× anti-death
correction. This homeostatic loop keeps sparsity at a stable 91%.
for t in range(T_total): # T_total = 10
# 1. Synaptic integration (low-pass filter on input)
i_syn = beta_syn * i_syn + current_in[t] # beta_syn ≈ 0.50
# 2. Membrane voltage: leak or clamp during refractory
if refractory:
v_mem = v_reset # = -0.1, hard clamp
else:
v_mem = beta_mem * v_mem + (1-beta_mem) * i_syn # beta_mem ≈ 0.85
# 3. Fire if above threshold (ATan surrogate gradient for backprop)
spike = spike_fn(v_mem, threshold) # → 0 or 1
# 4. Soft reset + cascade ripple injection
v_mem = v_mem - spike * threshold # soft reset (≈ 0)
if spike.sum() > 0:
i_syn += cascade_amplify(spike) # neighbor ripple (Step 3)
refrac_counter = set_refractory(spike, 2) # 2-step lock
i_syn. Middle (Teal) = membrane
voltage v_mem. Notice the sub-threshold cascade
injection (amber): it raises v_mem slightly but doesn't
trigger a spike, keeping gradients alive. When the main burst pushes
v_mem over the threshold 0.12,
Top (Coral) spikes fire, and the neuron enters a
2-step refractory clamp at −0.1.
The Associative Cascade – Keeping Gradients Alive
Here's the central paradox of deep spiking networks: efficiency demands silence, but silence kills learning. At 91% sparsity, backpropagation gradients multiply through seas of zeros and vanish before reaching early layers. Nord's Associative Cascade solves this.
The D=496 neurons are partitioned into
64 topological clusters arranged in a ring. When
neurons in a cluster fire, they send a soft ripple of sub-threshold
current through a learnable 64×64 weight matrix. In the
real code that matrix is passed through a sigmoid, so the cascade is
soft and non-negative everywhere. Nearby clusters start with stronger
couplings, and this view emphasizes those strongest local links so the
structure stays readable.
Crucially, the neighbor weights and per-cluster gains are learned during training, so the network decides which cluster topologies are most useful. (Notice that the cascade also reverberates back into the originating cluster! While the neuron that just fired is protected by a hard refractory period, its silent neighbors in the same cluster receive the sub-threshold ripple, priming them for subsequent timesteps.)
# Group D=496 neurons into nc=64 clusters
cluster_fire = scatter_add(spikes, cluster_ids) / (D // nc) # (B, 64)
# Learned soft neighbor weight matrix – sigmoid keeps weights in (0,1)
W = torch.sigmoid(neighbor_weights) # (64, 64)
# Each cluster receives weighted sum of neighbors' spike rates
neighbor_signal = (W @ cluster_fire.T).T # (B, 64)
# Per-cluster learnable gain (not one global scalar)
neighbor_signal = neighbor_signal * cluster_gain # (B, 64)
# Scatter back to neurons by cluster membership → added to i_syn
return neighbor_signal.gather(1, cluster_ids) # (B, D=496)
scatter_add). The cluster sends lateral ripples along
the inner ring (W matrix). This view emphasizes the
strongest local couplings for readability. In the real model the
learned sigmoid-bounded matrix stays soft and non-negative across
all cluster pairs. The returning signal (gather) raises
constituent neurons into a teal sub-threshold state.
Sparse Spiking Synaptic Resonance
Standard Transformers use Self-Attention: an expensive O(S²) operation that computes a similarity score between every pair of tokens using continuous float vectors. Nord replaces this entirely with Spiking Synaptic Resonance.
Query and Key projections are each passed through their own LIF neurons, producing binary spike patterns across T=10 timesteps. Rather than naively flattening time and head dimensions, v4.2 uses learned temporal mixing weights: each timestep gets a softmax-normalized importance score, Q/K are formed as weighted sums over time, and RoPE is applied before the resonance matrix is scored. The resonance score between positions is then the dot product of these temporally mixed representations.
To save memory, only the Top-K = min(64, S) resonance scores per query position are kept. All others are zeroed out before softmax, making attention 87.5% sparse for typical sequence lengths.
# Project to Q, K, V – then spike Q and K independently
q_spikes, _ = lif_q(W_q(x)) # (T=10, B×S, D) binary spikes
k_spikes, _ = lif_k(W_k(x)) # (T=10, B×S, D) binary spikes
# FIX E: Learned temporal mixing (not naive T*Dh flattening)
tw_q = softmax(temporal_mix_q) # (T,) learned per-timestep weight
tw_k = softmax(temporal_mix_k) # (T,) learned per-timestep weight
q = (q_spikes * tw_q).sum(dim=0) # (B, H, S, Dh) weighted sum over time
k = (k_spikes * tw_k).sum(dim=0) # preserves spike timing semantics
resonance = q @ k.transpose(-2, -1) # (B, H, S, S)
# Causal mask (no peeking at future tokens)
resonance.masked_fill_(future_mask, float("-inf"))
# Top-K Sparsity: keep only top-64 per query row
sparse_res = full(-inf)
sparse_res.scatter_(-1, top_k_indices, top_k_values) # zero-mask the rest
attn = softmax(sparse_res)
K = min(64, S).
K = min(64, S) after temporal
mixing and RoPE. RoPE itself is part of the code path but is not
visualized in this small canvas.
Spike-Driven Mixture of Experts
In the association zone (blocks 3–4), each token is routed to 2 of 4 specialized experts based on its spike-rate pattern. Unlike standard MoE that routes on dense token embeddings, Nord's router computes per-cluster firing rates from the binary spike output, groups them by expert, and selects the top-2 highest-scoring experts per token.
Each expert is a full up → LIF → down → LIF sub-network.
Dispatch is vectorized: the code loops over 4 experts
(not 2048 tokens), extracting each expert's assigned token batch in
one operation. A load-balancing loss
(0.01 × n_experts × Σ(freq × prob)) prevents expert
collapse – ensuring all 4 experts stay utilized.
# Route LIF: spike the input to get binary patterns
route_spikes, _ = route_lif(x) # (T, N, D)
# Per-cluster firing rates → grouped by expert
cluster_rates = scatter_add(spikes) / cluster_size
expert_scores = cluster_rates.reshape(N, 4, 16).mean(-1)
# Top-2 expert selection + softmax weights
top_scores, top_idx = topk(expert_scores, k=2)
weights = softmax(top_scores) # (N, 2)
# Vectorized dispatch: loop over 4 experts, not N tokens
for e in range(4):
active = (top_idx == e).any(dim=-1)
output[:, active] += expert_e(x[:, active]) * w[active]
Memory Cortex - Gated Temporal Memory
Between the association blocks and the executive zone, Nord maintains
a 128-slot persistent memory bank. Unlike attention
(which is reset per forward pass), the Memory Cortex's LIF neurons
use tau_mem = 0.99, a near-zero decay that retains
information across many tokens.
The write path is persistent and direct: input is projected into
memory space and integrated by memory_lif. Separately,
a gate path projects the same input through gate_lif,
averages its activity across time, and converts that into a
read gate with a learned
gate_threshold = 0.3. 4 attention read
heads then read across the full memory membrane trace over
all timesteps, and the gated readout is mixed back into the main
signal through the learnable memory_mix parameter.
# Write: project input → persistent memory LIF (τ=0.99)
mem_spikes, v_mem = memory_lif(to_memory(x)) # (T, N, 128)
# Gate: build a read gate from gate spikes, separate from writes
gate_sig = gate_spikes.mean(dim=0)
gate = sigmoid((gate_sig - gate_threshold) * 10)
# Read: 4-head temporal attention over ALL timesteps
attn = softmax(read_query · read_key(v_mem)) # (T, N, 4)
mem_read = (v_mem * attn).sum(dim=0) * gate # gated readout
# Mix back into main signal
x = x + memory_mix * from_memory(mem_read) # mix ≈ 0.1
memory_lif.
τ=0.99) barely decays compared with the normal
τ=0.85 reference trace.
Reward-Modulated STDP - Defined in Code, Inactive in v4 Path
nord_v4_models the STDPEngine still exists,
but the current NordModel.forward() and
chat_v4.py path do not call those updates.
Nord v3 could update its own weights during inference
using Spike-Timing Dependent Plasticity (STDP). The
current v4 codebase still contains that bounded STDP engine, now
restricted to the executive zone with tighter weight bounds
(w_min=-0.15, w_max=0.5) and a
max_update_norm=0.01 clamp. This section documents the
mechanism that remains in code, not an active stage in the latest
forward path.
If re-enabled, standard STDP would still be blind to whether the
network is actually improving. Nord addresses that with a
Reward Signal: the final weight update is multiplied
by 2 × sigmoid(loss_EMA − current_loss) − 1, which maps
to (+1) when predictions are improving (LTP) and (−1) when they're
worsening (LTD reversal).
for t in range(T):
# Decaying pre- and post-synaptic eligibility traces
trace_pre = decay_plus * trace_pre + pre_spikes[t] # tau_plus = 20
trace_post = decay_minus * trace_post + post_spikes[t] # tau_minus = 20
# LTP: post fires → reinforce connections from active pre neurons
dW += a_plus * outer(post_spikes[t], trace_pre) # a_plus = 0.005
# LTD: pre fires → weaken connections to recently-active post
dW -= a_minus * outer(trace_post, pre_spikes[t]) # a_minus = 0.005
# Reward modulation: aligns local Hebbian with global LM loss
reward = sigmoid(loss_EMA - current_loss) # 1.0 = improving, 0.0 = worsening
dW_final = dW * (2.0 * reward - 1.0) # maps to (−1, +1)
# Magnitude bound + zone check (FIX F: executive zone only)
if dW.norm() > max_update_norm: # max_update_norm = 0.01
dW = dW * (max_update_norm / dW.norm())
layer.weight += dW_final.clamp(-0.15, 0.5) # w_min=-0.15, w_max=0.5
LeakyClamp – Keeping the Sub-Threshold Ghost Alive
In virtually every modern neural network, ReLU is the non-linearity: for any positive input, pass it through; for any negative input, return exactly zero. For a traditional classification network this is fine. For a spiking network it is fatal.
A LIF neuron's natural resting voltage is
v_reset = -0.1 – intrinsically negative. ReLU annihilates
this: the entire sub-threshold state, which carries rich information
about how close a neuron is to firing, becomes identically zero.
Gradients cannot flow through zero; neurons can never recover from
silence.
In the sensory and association zones, Nord uses LeakyClamp. For positive values it's identity; for negative values it applies a learned per-channel leak slope (≈0.1) – gently compressing negative signals down to a learnable floor (initialized at −0.1). The sub-threshold ghost is preserved.
The executive zone is the exception: it uses
force_nonneg=True, which applies standard
F.relu(x) instead. This intentionally kills
negative spike propagation before the readout, ensuring only clean
positive activations reach the output head.
# FIX M: Executive zone forces non-negative output
if self.force_nonneg:
return F.relu(x) # executive: no negatives allowed
# Sensory / Association zones: preserve sub-threshold state
neg_part = (self.leak * x).clamp(min=self.floor) # leak ∈ (0,1)
return torch.where(x >= 0, x, neg_part) # floor ≈ -0.1
# leaky_clamp(-0.1) → 0.1*(-0.1) = -0.01 (preserved!)
# executive(-0.1) → relu(-0.1) = 0.0 (clean output)
force_nonneg=True (F.relu) to
prevent negative spike contamination before readout - same shape as
ReLU, but by design, not by accident.
EMA Temporal Readout - From Spikes to Floats
After 6 zonal blocks plus the Memory Cortex, the network holds 10 timesteps of scattered 1-bit binary spikes. How do you accurately predict which of 128,000 English words comes next from scattered ones and zeros? You don't.
The final readout_lif passes the spike stream through one
last LIF step to accumulate its
continuous 16-bit membrane potential tensor
v_membrane (10 × S × 496). Rather than doing a simple
mean over timesteps (which discards temporal ordering), Nord applies
an Exponential Moving Average, giving the most recent
timestep the highest weight. The learnable decay
α ≈ 0.8 means the contribution of timestep
t scales as (1−α)·α^(9−t).
This hybrid readout (smoothed membrane potential + mean spike rate) bypasses the fundamental 1-bit bottleneck that makes SNN language modeling "impossible."
# Readout LIF: spikes AND membrane potential are returned
readout_spikes, v_membrane = readout_lif(x_flat) # (T=10, B×S, D=496)
# EMA Temporal Smoothing Readout
alpha = sigmoid(readout_ema_raw) # learnable α ≈ 0.80
ema = zeros(B*S, D)
for t in range(T_total): # t=0..9
ema = alpha * ema + (1 - alpha) * v_membrane[t]
# Weight of v[t] in final ema = (1-α)·α^(9−t)
# t=9 → 0.200 (most recent, highest) t=0 → 0.027 (oldest, lowest)
# Hybrid: smoothed membrane + spike rate → LM head → 128k logits
readout = ema.reshape(B,S,D) + readout_spikes.mean(dim=0).reshape(B,S,D)
logits = lm_head(layernorm(readout)) # (B, S, 128256)
EMA(v_mem), the mean spike contribution, and the final
EMA(v_mem) + mean(readout_spikes) vector that is fed
into the 128k-vocab language head.
The Complete Forward Pass
Everything assembled: one token enters at the top as an integer index. It exits at the bottom as next-token logits over 128,256 vocabulary entries. Every stage has a dedicated visualization above. This final view shows them all connected, with flowing particles representing the data tensor as it transforms through Nord's architecture.
Notice how the tensor shape changes at each critical stage:
[1] (token) → [496] (embed) →
[10×496] (temporal current) → [10×496]
(input spikes) → [10×496] (sensory) →
[10×496] (association / MoE) →
[10×496] (memory mix-back) →
[10×496] (executive) →
[10×496] (readout membrane + spikes) →
[496] (hybrid readout) → [128k] (logits).