Architecture comparison

Self-Attention vs RNN vs CNN

Compare how sequence order, receptive field, parallelism, and context flow differ across the three families.

Three ways to move information through a sequence — Image 1: RNNs pass information step by step, CNNs expand from local windows, and self-attention connects tokens directly.

Background

Natural language, time-series, and other sequential data are not just lists of independent items. The meaning of a word like it in a sentence depends on surrounding words.

Without a way to share information across positions, a model cannot resolve pronouns, capture long-term dependencies, or detect patterns that unfold over time.

RNNs, CNNs, and self-attention all answer the same question: how should information move across a sequence?

Idea

RNNs use sequential memory. Information moves step by step, and each hidden state carries a running summary of what has been seen so far.

CNNs use local window expansion. Convolutional filters first read nearby tokens, and deeper layers expand the receptive field.

Self-attention uses direct global interaction. Each token compares itself with every other token and decides which positions matter most.

Formula

\[h_t=f\left(W_xx_t+W_hh_{t-1}+b\right)\]

RNNs update a hidden state from the current input and the previous hidden state.

\[y_t=\sum_{i=-k}^{k}W_i\,x_{t+i}\]

CNNs combine tokens inside a local sliding window.

\[\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Self-attention compares queries and keys, then combines values using learned weights.

\[\alpha_{ij}=\text{softmax}(q_i\cdot k_j),\quad z_i=\sum_j\alpha_{ij}v_j\]

For one token, attention weights decide how much each other token contributes to its context vector.

Symbols

x_t: input vector at position t.
h_t: hidden state at position t in an RNN.
W_x and W_h: learnable weight matrices for input and hidden state.
y_t: output of a CNN at position t.
W_i: convolution filter weight for offset i.
Q, K, V: query, key, and value matrices.
q_i, k_j, v_j: query for token i and key/value for token j.
d_k: dimension of the key vectors.
alpha_ij: attention weight from token i to token j.
z_i: output context vector at position i.

Example

Consider the sentence: The animal did not cross the street because it was tired. The word it refers to the animal, not the street.

An RNN carries information forward through hidden states. By the time the model reaches it, earlier words influence the hidden state indirectly, but their influence can fade as the sequence grows.

A CNN with a small kernel first sees local neighbours such as because it was. To connect it to animal, the model needs stacked layers or larger receptive fields.

Self-attention lets the token it directly assign high weight to animal and lower weight to unrelated words. This direct link makes long-distance dependencies easier to capture.

Workflow

RNN: read tokens one by one, update the hidden state, and use the accumulated state to contextualize the target token.
CNN: slide a kernel over local windows, compute local combinations in parallel, and stack layers to expand the receptive field.
Self-attention: compute query, key, and value vectors for each token, compare all token pairs, softmax the scores, and combine value vectors into context vectors.
For self-attention, positional information must be added so the model knows which token came first.

Pros & Cons

Method	Trade-off
RNN	Natural for ordered and streaming data, but difficult to parallelize and weaker on very long dependencies.
CNN	Efficient and good at local patterns, but distant dependencies require depth, dilation, or larger kernels.
Self-attention	Directly links all token pairs and parallelizes well, but has quadratic cost and needs positional encoding.

Takeaway

RNNs, CNNs, and self-attention are three complementary ways to build context in sequence data.

An RNN carries information step by step, a CNN expands from local windows through depth, and self-attention lets each token look directly at every other token.

The right choice depends on the task, the sequence length, and computational constraints.