Understanding Transformers in NLP

Saurabh Harak
30 min read2 days ago

--

If you’ve been keeping an eye on the advancements in Natural Language Processing (NLP) over the past few years, you’ve undoubtedly heard about transformers. Since their introduction in 2017, transformers have dramatically reshaped the NLP landscape. Before they came onto the scene, the go-to architectures for processing sequential data were Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs). While these models were pivotal in advancing NLP, they had inherent limitations — particularly in capturing long-range dependencies and efficiently processing sequences of data.

In a groundbreaking paper titled “Attention Is All You Need,” a team at Google Brain proposed the transformer architecture. This novel approach leverages attention mechanisms to process data, effectively addressing the shortcomings of RNNs and LSTMs. The introduction of transformers was a seismic shift in NLP, enabling more effective training and better performance across various applications.

Why Transformers Are Important

Transformers have become the cornerstone of numerous advancements in NLP, enabling major improvements in tasks such as:

  • Machine Translation: Transformer models like Google’s BERT and OpenAI’s GPT series have revolutionized machine translation, providing enhanced translation quality and speed.
  • Text Summarization: They’ve enabled the generation of coherent, contextually accurate summaries, making large volumes of information more digestible.
  • Question-Answering Systems: By understanding context deeply, transformers have significantly enhanced question-answering models, making them capable of retrieving precise answers from massive datasets.
  • Language Modeling: With models like GPT-3, transformers have pushed the boundaries of text generation, opening new possibilities for creative content generation, conversational agents, and more.

The impact of transformers extends beyond NLP. For instance, the Vision Transformer (ViT) has demonstrated that transformers can be effectively adapted for image recognition tasks, achieving results on par with traditional convolutional neural networks (CNNs).

What Is a Transformer?

At its core, a transformer is a neural network architecture that relies solely on attention mechanisms to process sequences of data. Unlike traditional RNNs or LSTMs, which process input data sequentially, transformers handle entire sequences simultaneously. This ability to process data in parallel makes transformers much more efficient, allowing them to scale to larger datasets and model sizes.

Transformer Architecture Overview

Image source https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)

Simplified architecture of Transformer

Image Source: The Math Behind Attention Keys, Queries, and Values Matrices

Understanding the Transformer Architecture

To fully grasp how transformers have revolutionized NLP, it’s essential to dive into their architecture. We’ll walk through each component step by step, using the following structure:

  1. Tokenization
  2. Embeddings
  3. Positional Encoding
  4. Transformer Blocks
  • Self-Attention Mechanism
  • Feedforward Neural Networks

5. Softmax Layer

1. Tokenization

Before any processing can take place, the input text must be broken down into manageable units — a process known as tokenization. Tokens can be individual words, subwords, or even characters, depending on the model and task. Tokenization ensures that the transformer can handle the input text as a sequence of understandable units.

For example, consider the sentence:

“The quick brown fox jumps over the lazy dog.”

Tokenization might split this into individual words:

  • “The”
  • “quick”
  • “brown”
  • “fox”
  • “jumps”
  • “over”
  • “the”
  • “lazy”
  • “dog”

In some models, especially those dealing with large vocabularies or subword units, tokens might be further broken down. This process helps the model handle rare words and capture meaningful subword patterns.

2. Embeddings

After tokenization, each token is transformed into a numerical vector through a process called embedding. Embeddings convert discrete tokens into continuous vector spaces, allowing the model to understand the semantic meaning of the text.

One popular method for creating embeddings is Word2Vec, which maps words into vectors that capture their semantic relationships. In transformers, embeddings are learned during training and represent each token’s meaning in a high-dimensional space. These vectors help the model grasp similarities between words based on context.

For instance, the words “king” and “queen” might have embeddings that are close to each other in the vector space, reflecting their related meanings.

3. Positional Encoding

Transformers process input sequences in parallel rather than in order, which can cause them to lose the sense of word order. To address this, positional encoding is added to the embeddings to provide information about the position of each token in the sequence.

Positional encoding can be implemented using sinusoidal functions that produce unique encodings for each position. By adding these positional encodings to the token embeddings, the model retains information about the order of the tokens.

This is crucial for understanding sentences like:

  • “The cat sat on the mat.”
  • “On the mat sat the cat.”

While the words are the same, their order changes the meaning. Positional encoding ensures that the transformer can distinguish between these nuances.

4. Transformer Blocks

The core processing happens within transformer blocks, which consist of layers that alternate between attention mechanisms and feedforward neural networks. Each block refines the model’s understanding through two key operations:

a. Self-Attention Mechanism

The self-attention mechanism allows each token to focus on other tokens in the sequence to understand context. It computes a weighted sum of the embeddings of all tokens, where the weights are determined by the similarity between tokens.

For example, in the sentence:

“She opened the door because her hands were full.”

The word “she” needs to be associated with “her” to understand that they refer to the same person. Self-attention helps the model capture these relationships by assigning higher weights to relevant tokens.

Transformers use multi-head attention, meaning multiple attention mechanisms operate in parallel. Each “head” can focus on different parts of the sequence, enriching the model’s ability to understand complex relationships.

b. Feedforward Neural Networks

After the attention mechanism has captured the relationships between tokens, the sequence passes through feedforward neural networks. These layers apply non-linear transformations to each token independently, helping the model capture more abstract patterns in the data.

The feedforward networks enhance the model’s representational capacity, allowing it to learn complex mappings from input to output.

5. Softmax Layer

After passing through multiple transformer blocks, the processed data reaches the Softmax layer. This layer converts the model’s output into probabilities.

The Softmax function normalizes the output scores, ensuring they sum to 1, and assigns a probability to each possible outcome. In language generation tasks, for example, the Softmax layer helps the model predict the next word by assigning probabilities to various candidates. The word with the highest probability is selected as the prediction.

Consider predicting the next word in the phrase:

“The sky is clear and the sun is…”

The Softmax layer might assign probabilities like:

  • “shining” → 0.8
  • “setting” → 0.1
  • “bright” → 0.05
  • “hidden” → 0.05

Here, “shining” has the highest probability and would be the model’s prediction.

Core Concepts in Transformer Architecture

1. Tokenization

What Is Tokenization?

Tokenization is the crucial first and final step in text processing and modeling, especially in the realm of machine learning and natural language processing (NLP). When working with textual data, computers can’t interpret raw text as humans do. Therefore, text must be represented as numbers for models to process and understand it. This is where tokenization comes into play — it breaks down text into manageable pieces called tokens, each of which is assigned a numerical representation or index that can be fed into a model.

Why Is Tokenization Important?

Text is inherently complex and full of nuances. A word can have different meanings depending on its context, and languages have various syntactic structures. To overcome these challenges, tokenization ensures that the text is transformed into a format that machine learning models can work with effectively.

By breaking down text into tokens, models can:

  • Handle Vocabulary Variations: Manage synonyms, antonyms, and homonyms by understanding words in context.
  • Reduce Complexity: Simplify the input data, making it computationally feasible to process large amounts of text.
  • Capture Meaning: Preserve the semantic and syntactic structure necessary for tasks like translation, sentiment analysis, and text generation.

The Tokenization Process in Large Language Models (LLMs)

In the workflow of a large language model (LLM), tokenization is the gateway to processing and understanding text. Here’s a step-by-step overview of how it works:

Image source https://docs.mistral.ai/guides/tokenization/
  1. Encoding the Input Text
    The first step is encoding the input text into tokens using a tokenizer. The tokenizer breaks down the text into smaller units such as words, subwords, or characters, depending on the granularity of the tokenization approach used.
    For example, consider the sentence: “Transformers are revolutionizing NLP.”
    A tokenizer might split this into tokens like:
  • “Transformers”
  • “are”
  • “revolution”
  • “izing”
  • “NLP”
  • “.”

2. Passing Tokens Through the Model
Once the text has been tokenized, the tokens are sent through the model, which typically consists of an embedding layer followed by transformer blocks.

  • Embedding Layer: Converts the tokens (which are just numbers at this point) into dense vectors that capture the semantic meaning of the tokens. Similar words or concepts will have vectors closer to each other in the vector space.
    For instance, the words “happy” and “joyful” might have embeddings that are close together, reflecting their related meanings.
  • Transformer Blocks: These blocks process the embeddings, analyzing the relationships and context between different tokens using self-attention mechanisms and feedforward neural networks. This helps the model make sense of the sequence of tokens and provides context for generating accurate results.

3. Decoding the Output
After the model processes the input and generates results, the final step is decoding. This process involves taking the output tokens and converting them back into human-readable text by mapping the tokens back to their corresponding words or characters using the tokenizer’s vocabulary.
For example, if the model outputs the tokens:

  • Index 120: “They”
  • Index 45: “are”
  • Index 678: “innovating”
  • Index 99: “rapidly”
  • Index 2: “.”

Types of Tokenization

There are several approaches to tokenization, each suitable for different languages and tasks:

1. Word-Level Tokenization: Splits the text into individual words based on spaces and punctuation. This method works well for languages where words are clearly separated by spaces, such as English.
Example:

o Input: “Machine learning is fascinating.”

o Tokens: [“Machine”, “learning”, “is”, “fascinating”, “.”]

2. Subword Tokenization: Breaks down words into smaller units called subwords. It’s particularly useful for handling unknown words and rare terms. Methods like Byte-Pair Encoding (BPE) or WordPiece are common for subword tokenization.
Example using BPE:

o Input: “unbelievable”

o Tokens: [“un”, “believ”, “able”]

3. Character-Level Tokenization: Breaks down the text into individual characters. It’s often used when dealing with languages that don’t use spaces (like Chinese) or when handling typos and misspellings.
Example:

o Input: “Data”

o Tokens: [“D”, “a”, “t”, “a”]

The Role of Tokenization in Model Performance

Tokenization directly impacts how a model learns and generates text:

  • Contextual Understanding: A well-tokenized dataset allows the model to focus on relevant parts of the text while better understanding the context.
  • Vocabulary Size: Tokenization affects the size of the vocabulary the model must learn. A larger vocabulary from word-level tokenization requires more memory and computational power. Subword tokenization reduces the vocabulary size while still handling a wide range of texts.
  • Handling Rare Words: Subword and character-level tokenization help the model manage rare or unseen words by breaking them into known sub-components.

Poor tokenization can lead to:

  • Loss of Meaning: If tokens are not appropriately segmented, the model might miss essential contextual cues.
  • Increased Complexity: An excessively large or small vocabulary can hinder the model’s ability to generalize and learn effectively.

The Role of Tokenization in Model Performance

Tokenization directly impacts how a model learns and generates text. A well-tokenized dataset allows the model to focus on relevant parts of the text while understanding the context better. Poor tokenization, on the other hand, can lead to confusion, as the model might struggle to capture the relationships between tokens, which results in poor performance.

Additionally, tokenization impacts the size of the vocabulary that a model must learn. A larger vocabulary (from word-level tokenization) could require more memory and computational power, whereas a smaller vocabulary (from sub-word tokenization) can reduce model complexity while still handling a wide range of texts.

2. Embeddings

What Are Embeddings?

Embeddings are one of the foundational concepts in modern NLP. They provide a way for machines to understand and process human language by representing words or phrases as numerical vectors in a high-dimensional space. Instead of treating words as isolated units, embeddings map each word to a specific point in this space, where its position reflects its meaning and usage in the language. This allows words with similar meanings to be placed closer together, helping models understand the relationships between words based on context.

For instance, consider the word “apple.” It can refer to a fruit or a technology company. The context in which the word is used helps determine its intended meaning, and embeddings play a crucial role in interpreting this by placing the word appropriately in the vector space.

Purpose of Embeddings

The main goal of embeddings is to capture the semantic relationships between words. Words that are frequently used in similar contexts tend to be located near each other in the embedding space. By placing semantically related words closer together, embeddings allow machine learning models to better understand the underlying meaning of words and phrases. This leads to improved performance in various NLP tasks like translation, summarization, and question-answering.

Visualizing Embeddings: A Simplified Example

To grasp how embeddings work, imagine the high-dimensional space as a two-dimensional plane. Although in reality, this space has hundreds or even thousands of dimensions, a 2D analogy helps us visualize the concept.

Image source : The math behind Attention Keys, Queries, and Values matrices

Clusters of Meaning

In this space, clusters of words form based on their semantic similarity:

  • Fruit Cluster: Words like “strawberry,” “orange,” “banana,” and “cherry” are grouped together because they are all fruits. They might be located in the same region of the plane.
  • Technology Cluster: In another part of the plane, words like “Microsoft,” “Android,” “laptop,” and “phone” cluster together, representing the technology domain.

This clustering occurs because embeddings are designed to capture patterns in language. Words that often appear in similar contexts are located near each other, reflecting their similarity in meaning.

The Ambiguity of “Apple” and the Role of Context

A common challenge in NLP is dealing with ambiguous words — those that have multiple meanings. For example, the word “apple” could refer to the fruit or the technology company, depending on the context. Without understanding the surrounding words, it’s unclear where to place “apple” in the embedding space. This is where context becomes essential.

Problem with “Apple”

Image source : The math behind Attention Keys, Queries, and Values matrices

When the word “apple” appears on its own, it’s ambiguous. Should it be placed in the fruit cluster or the technology cluster in the embedding space? Without context, there’s no clear answer.

Solution: Using Context to Disambiguate

To resolve this, embedding models rely on the context in which a word appears to determine its meaning.

· Fruit Context:

Consider the sentence: “Please buy an apple and an orange.”

Here, the presence of the word “orange” (a fruit) provides context. Based on this, “apple” is interpreted as a fruit, and its embedding shifts closer to the other fruits in the space.

· Technology Context:

Now take the sentence: “Apple unveiled a new phone.”

In this case, the word “phone” provides technological context, signaling that “apple” refers to the company. The “apple” vector is thus pulled closer to other technology-related words.

Dynamic Embedding Adjustments

The embedding of a word like “apple” is dynamically adjusted based on the surrounding context. This movement in the embedding space is critical for accurate language understanding. Contextual embeddings, used in models like BERT and GPT, are designed to capture this dynamic adjustment, ensuring that a word’s meaning is interpreted correctly based on the surrounding text.

Image source : The math behind Attention Keys, Queries, and Values matrices

Gravitational Pull Analogy

One helpful way to visualize how words influence each other in the embedding space is to use the metaphor of gravitational pull. Words that are similar in meaning exert a stronger pull on each other, drawing their vectors closer together, much like how gravity pulls objects toward one another.

Strong Gravitational Pull

Words that are highly related have a strong semantic similarity, creating a strong gravitational pull between their embeddings.

· Example:

In the sentence “I bought an apple and an orange,” the words “apple” and “orange” are both fruits. Their embeddings exert a strong gravitational pull on each other, pulling them closer in the fruit cluster.

Weak Gravitational Pull

On the other hand, words that are not semantically related have less influence on each other.

· Example:

In the sentence “Please hand me an apple,” words like “please” and “hand” exert a weak gravitational pull on “apple” because they are function words with minimal semantic connection to the concept of a fruit or a company.

Influence of Multiple Contextual Words

Context in language includes all surrounding words. Each word contributes to shifting a word like “apple” towards its correct meaning in a given sentence.

Consider the sentence:

“I ate a banana, strawberry, lemon, blueberry, and an apple.”

In this sentence, multiple fruit names are mentioned, creating a strong fruit-related context. The combined gravitational pull from these words strongly shifts the embedding of “apple” towards the fruit cluster. This cumulative effect ensures that the model confidently interprets “apple” as a fruit.

Rethinking Distance and Similarity in the Embedding Space

In the embedding space, distance represents semantic similarity. Words that are closer together are more similar in meaning, while words that are far apart have little in common.

Concept of Similarity

Similarity metrics, such as cosine similarity, measure how close two vectors are in the embedding space. A high similarity score indicates that two words share similar meanings, while a low score suggests they are semantically different.

  • High Similarity: Words like “dog” and “puppy” would have a high cosine similarity because they are closely related.
  • Low Similarity: Words like “dog” and “table” would have low similarity, as they are unrelated concepts.

The gravitational pull between words is proportional to their semantic similarity, allowing models to understand complex relationships in language

3. Positional Encoding

Why Positional Encoding Is Necessary

One of the unique features of transformers is their ability to process input sequences in parallel rather than sequentially, as is the case with Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs). While this parallel processing makes transformers highly efficient and capable of capturing long-range dependencies, it introduces a challenge: transformers lack an inherent way to capture the order of words in a sequence.

In natural language, the order of words is crucial for conveying meaning. Simply knowing which words are present isn’t enough — their sequence dictates the sentence’s interpretation.

Example:

  • “Write a story.” vs. “Story I write.”

The first sentence is a clear command, while the second is grammatically incorrect and lacks coherent meaning. Without a mechanism to account for word order, transformers might struggle to understand and generate meaningful text.

How Positional Encoding Works

Positional encoding addresses this problem by introducing information about the position of each word in the sequence into the model. Here’s how it works:

1. Initial Word Embeddings

Each word in the vocabulary is represented by a high-dimensional vector that captures its semantic meaning. Without positional encoding, sentences with the same words but different orders would have identical embeddings, leading to ambiguity in interpretation.

2. Perturbing Embeddings Based on Position

To introduce positional information, the model modifies each word’s embedding using a unique positional encoding vector based on its position in the sequence.

Process:

  • For each word in the sequence, a positional encoding vector is added to its original embedding.
  • This new vector encodes both the semantic meaning of the word and its position in the sentence.

Visualization: Imagine arrows representing the direction and magnitude of perturbation for each word’s embedding:

Image source : The math behind Attention Keys, Queries, and Values matrices
  • The first word might be shifted to the right.
  • The second word might be shifted upward.
  • The third word could be shifted diagonally, and so on.

This systematic perturbation ensures that the embeddings of the same word become distinct depending on its position in the sentence.

3. Different Embeddings for Different Sentences

Applying positional encoding means that sentences with the same words in different orders result in distinct sets of embeddings. This enables the model to interpret each sentence accurately according to its word order.

Characteristics of Positional Encoding

  • Non-Semantic Nature: Positional encoding does not carry any semantic meaning by itself. It is simply a mathematical way to inject positional information into word embeddings.
  • Sequence Patterns: The perturbations follow a specific pattern designed to be unique for each position. This allows the model to distinguish between different positions in a sequence.

Intuition Behind Positional Encoding

Think of positional encoding as adding a unique “stamp” to each word’s embedding that indicates its position in the sequence. This stamp alters the embedding just enough to differentiate it from the same word in a different position, without significantly affecting its semantic meaning.

For example, in the sentences:

  • “The cat sat on the mat.”
  • “On the mat sat the cat.”

Positional encoding allows the model to recognize that, although the words are the same, their positions have changed, leading to different meanings.

The Impact on Self-Attention Mechanism

In the self-attention mechanism, the model computes attention scores based on the embeddings of the words. By incorporating positional encodings, these embeddings now contain positional information, enabling the attention mechanism to consider word order when computing relationships between words.

This means the model can understand that “cat sat” is different from “sat cat,” even though the same words are involved.

4. Transformer Blocks: The Heart of the Transformer Architecture

At the core of the transformer’s power lies the transformer blocks, which consist of layers that alternate between attention mechanisms and feedforward neural networks. Each block refines the model’s understanding through two key operations:

  1. Self-Attention Mechanism
  2. Feedforward Neural Networks

In this section, we’ll delve deep into the self-attention mechanism, including the concept of multi-head attention, and explore how it dynamically adjusts word embeddings based on context. We’ll refer back to examples introduced earlier, such as the ambiguous word “apple” and the gravitational pull analogy, to illustrate these concepts.

Attention Mechanism: Focusing on What’s Important

Understanding the Embedding Space

As we’ve discussed earlier, every word in a transformer model is represented as a vector in a high-dimensional embedding space. Words with similar meanings are located near each other, forming clusters that reflect semantic relationships.

  • Fruit Cluster: Words like “strawberry,” “orange,” “banana,” and “cherry” are grouped together because they are all fruits.
  • Technology Cluster: Words like “Apple” (as a company), “Microsoft,” “laptop,” and “smartphone” cluster together due to their association with technology.

However, ambiguous words like “apple” can belong to more than one cluster, depending on the context. This brings us to the importance of context in determining a word’s meaning.

The Challenge of Ambiguous Words: The Case of “Apple”

As previously mentioned, the word “apple” can refer to a fruit or a technology company. Without context, it’s ambiguous where “apple” should be placed in the embedding space.

  • Initial Placement: “Apple” might be situated somewhere between the fruit and technology clusters.

How Context and Attention Clarify Meaning

The self-attention mechanism allows the model to focus on relevant words in the context to determine the correct meaning of ambiguous words.

Revisiting Our Examples:

Fruit Context:

“Please buy an apple and an orange.”

  • The presence of “orange,” another fruit, provides context that “apple” refers to the fruit. Through self-attention, “apple” is influenced by “orange,” pulling its embedding towards the fruit cluster.

Technology Context:

“Apple unveiled a new smartphone.”

  • Here, words like “unveiled” and “smartphone” are associated with technology. The self-attention mechanism allows “apple” to be influenced by these words, shifting its embedding towards the technology cluster.

The Bear Example:

Consider the sentence: “The bear ate the honey because it was…”

Image Source: The Math Behind Attention Keys, Queries, and Values Matrices

The ambiguity lies in the pronoun “it.” Does “it” refer to “the bear” or “the honey”? The context and the word that completes the sentence help clarify the intended meaning.

Possible Completions:

“The bear ate the honey because it was hungry.”

Here, “it” refers to “the bear.”

“The bear ate the honey because it was delicious.”

In this case, “it” refers to “the honey.”

Applying Self-Attention:

Embedding the Words:

All words in the sentence are represented in the embedding space.

Calculating Attention Scores:

The pronoun “it” calculates attention scores with other words in the sentence.

Contextual Influence:

With “hungry”:

Image Source: The Math Behind Attention Keys, Queries, and Values Matrices
  • “Hungry” is semantically closer to “bear.” The attention mechanism assigns higher weights between “it,” “bear,” and “hungry,” pulling “it” towards the “bear” cluster in the embedding space.

With “delicious”:

Image Source: The Math Behind Attention Keys, Queries, and Values Matrices
  • “Delicious” is semantically closer to “honey.” The attention mechanism increases the weights between “it,” “honey,” and “delicious,” shifting “it” towards the “honey” cluster.

Result:

  • The self-attention mechanism dynamically adjusts the embedding of “it” based on the context provided by surrounding words, resolving the ambiguity.

Gravitational Pull Analogy

Recall our gravitational pull analogy. Words that are semantically related exert a stronger “pull” on each other in the embedding space. The self-attention mechanism quantifies this pull, allowing the model to adjust word embeddings based on contextual relevance.

  • Strong Pull: Semantically similar words (e.g., “bear” and “hungry” or “honey” and “delicious”) exert a strong influence on each other.
  • Weak Pull: Less related words have minimal influence on each other.

How Self-Attention Dynamically Adjusts Embeddings

The self-attention mechanism computes attention scores that determine how much each word should consider other words in the sequence.

Mechanism Overview:

  1. Query, Key, and Value Vectors:

For each word, the model generates three vectors:

  • Query (Q): Represents the word we’re focusing on.
  • Key (K): Represents each word in the sequence.
  • Value (V): Contains the information to be passed along.

2. Calculating Attention Scores:

  • The attention score between two words is calculated by taking the dot product of their query and key vectors.
  • These scores are then scaled and passed through a Softmax function to produce attention weights.

3. Updating Word Embeddings:

  • Each word’s embedding is updated by taking a weighted sum of the value vectors of all words, weighted by the attention weights.

Applying to the Bear Example:

When “it” is Ambiguous:

  • “It” generates attention scores with “bear,” “honey,” and the completion word (“hungry” or “delicious”).

With Completion “hungry”:

  • Higher attention weights between “it,” “bear,” and “hungry.”
  • “It” embedding shifts towards “bear,” resolving the ambiguity.

With Completion “delicious”:

  • Higher attention weights between “it,” “honey,” and “delicious.”
  • “It” embedding shifts towards “honey.”

Multi-Head Attention: Enhancing the Model’s Understanding

Within the self-attention mechanism, transformers use multi-head attention to allow the model to focus on different aspects of the relationships between words — all within the attention block.

What Is Multi-Head Attention?

Multiple Perspectives:

  • Instead of performing a single attention function, the model runs multiple attention mechanisms, or “heads,” in parallel within the same attention block.

Independent Attention Heads:

  • Each head has its own set of query, key, and value weight matrices.
  • This allows each head to capture different types of relationships and features from the input data.

Why Is Multi-Head Attention Important?

Capturing Diverse Relationships:

  • Different heads can focus on different linguistic aspects:
  • One head might capture syntactic dependencies.
  • Another might focus on semantic meanings or long-range dependencies.

Improved Representation:

  • By combining the outputs from multiple heads, the model constructs a richer, more nuanced representation of the input sequence.

Applying Multi-Head Attention to the Bear Example

To further illustrate how multi-head attention works within the attention block, let’s dive deeper into the bear example:

Sentence with Ambiguity:

The bear ate the honey because it was…”

The pronoun “it” is ambiguous — it could refer to either “the bear” or “the honey.” The word that completes the sentence provides the necessary context to resolve this ambiguity.

Possible Completions:

1.“The bear ate the honey because it was hungry.”

  • Here, “it” refers to “the bear.”

2. “The bear ate the honey because it was delicious.”

  • In this case, “it” refers to “the honey.”

How Multi-Head Attention Resolves the Ambiguity

In the transformer model, multi-head attention allows different attention heads to focus on various aspects of the sentence, enabling the model to disambiguate the pronoun “it” based on context.

Step-by-Step Explanation:

  1. Tokenization and Embedding:

Tokenization: The sentence is broken down into tokens:

  • [“The”, “bear”, “ate”, “the”, “honey”, “because”, “it”, “was”, “hungry/delicious”]

Embedding: Each token is converted into an embedding vector that captures its semantic meaning.

2. Adding Positional Encoding:

  • Positional encodings are added to the embeddings to retain information about word order, crucial for understanding the sentence’s structure.

3. Self-Attention Mechanism with Multi-Head Attention:

Multiple Attention Heads:

  • Let’s assume we have three attention heads for simplicity.
  • Each head processes the embeddings differently, focusing on different relationships.
  1. Head : Semantic Similarity Focus

Objective: Capture semantic relationships between words.

Process:

  • The pronoun “it” generates query vectors, while other words generate key and value vectors.
  • Compute attention scores between “it” and all other words.

Case with “hungry”:

  • “Hungry” is semantically associated with living beings like “bear.”
  • High attention weights are assigned between “it”, “bear”, and “hungry.”
  • The embedding of “it” is adjusted toward “bear.”

Case with “delicious”:

  • “Delicious” is associated with edible items like “honey.”
  • High attention weights are assigned between “it”, “honey”, and “delicious.”
  • The embedding of “it” shifts toward “honey.”

Head 2: Syntactic Structure Focus

Objective: Analyze grammatical relationships and sentence structure.

Process:

  • Focuses on the roles of words (subject, object, verb).
  • Understands that “bear” is the subject performing the action “ate.”

Effect:

  • Helps determine whether “it” is more likely to refer to the subject (“bear”) or the object (“honey”) based on syntax.

Head 3: Positional Proximity Focus

Objective: Consider the positions of words relative to “it.”

Process:

  • Words closer to “it” may have higher attention weights.
  • The word immediately following “it” (i.e., “was”) and the word after that (“hungry” or “delicious”) significantly influence the interpretation.

Effect:

  • Helps the model understand that the adjective following “was” is directly linked to “it.”

Combining the Outputs

  • The outputs from all three attention heads are concatenated.
  • A final linear transformation is applied to integrate the information from all heads.

Feedforward Neural Network

  • The combined embeddings pass through a feedforward neural network.
  • Non-linear transformations capture complex patterns and interactions.

Final Interpretation

With “hungry” Completion:

  • High attention weights from Head 1 and Head 2 between “it,” “bear,” and “hungry.”
  • The model concludes that “it” refers to “the bear.”

With “delicious” Completion:

  • High attention weights from Head 1 and Head 2 between “it,” “honey,” and “delicious.”
  • The model determines that “it” refers to “the honey.”

Visualizing the Attention Weights

Attention Maps can illustrate how attention weights are distributed between words.

The Power of Multi-Head Attention

Parallel Processing:

  • Each attention head analyzes the sentence from a different perspective simultaneously.

Rich Contextual Understanding:

  • By integrating various types of information (semantic, syntactic, positional), the model builds a comprehensive understanding.

Disambiguation:

  • Multi-head attention effectively resolves ambiguities by considering multiple factors influencing word meaning.

Why Multi-Head Attention is Effective in This Example

Captures Multiple Relationships:

  • Head 1 focuses on semantic meaning, linking adjectives to the nouns they describe.
  • Head 2 understands grammatical roles, determining likely antecedents for pronouns.
  • Head 3 uses positional information to reinforce connections based on word proximity.

Enhances Model Robustness:

  • Even if one head misinterprets a relationship, other heads can compensate, leading to correct overall understanding.
  • Improves Generalization:
  • By learning diverse patterns, the model can apply its understanding to new, unseen sentences with similar structures.

4.2 Feedforward Neural Networks: Deepening the Model’s Understanding

After the self-attention mechanism has dynamically adjusted the embeddings based on context, the transformer’s next step is to further process this information to capture even more complex patterns. This is where the Feedforward Neural Network (FFN) comes into play within each transformer block.

What Is the Feedforward Neural Network in a Transformer?

Image Source: The Math Behind Attention Keys, Queries, and Values Matrices

Think of the FFN as the “refinement layer” of the transformer block. It’s like the fourth floor in a building, situated right after the attention mechanism. While the attention mechanism allows the model to focus on relevant parts of the input sequence, the FFN processes the context-rich embeddings independently to enhance feature representation.

Here’s a comforting thought: if you’ve wrapped your head around the attention mechanism, the FFN is relatively straightforward. It’s essentially a two-layer fully connected neural network that applies non-linear transformations to each token’s embedding individually.

Where Does It Fit In?

In each transformer block, the FFN comes immediately after the multi-head attention layer. The sequence is as follows:

  1. Multi-Head Attention: Captures relationships between tokens.
  2. Add & Norm: Adds a residual connection and applies layer normalization.
  3. Feedforward Neural Network: Processes each token independently to capture complex patterns.
  4. Add & Norm: Another residual connection and layer normalization.

Breaking Down the Structure of the FFN

So, how does the FFN work? Let’s break it down into its two layers:

First Layer

  • Input: Receives the context-rich embeddings from the attention layer, typically of dimension dmodel​ (e.g., 512).
  • Fully Connected: Every neuron is connected to every neuron in the preceding layer.
  • Expansion: The layer expands the dimensionality from dmodeld to a higher dimension (e.g., 2048).
  • Activation Function: Applies a non-linear activation function, commonly the Rectified Linear Unit (ReLU). This introduces non-linearity, allowing the network to learn complex patterns.

Second Layer

  • Fully Connected: Again, fully connected to the previous layer.
  • Reduction: Reduces the dimensionality back to dmodel​ (e.g., from 2048 back down to 512).
  • No Activation Function: Applies a linear transformation without an activation function.

Why Is the Feedforward Neural Network Important?

Adding Complexity

The FFN introduces additional weights and biases into the model, increasing its capacity to learn intricate patterns in the data. By expanding the dimensionality, it provides more “room” for the network to capture subtle nuances in language.

Non-Linear Transformations

The activation function in the first layer introduces non-linearity, which is crucial for modeling the complex, non-linear relationships inherent in human language.

  • Role of Activation Functions: Without them, the FFN would be limited to learning linear mappings, severely restricting its expressiveness.
  • Impact: Non-linear activation functions like ReLU allow the network to capture intricate patterns that linear transformations cannot.

Enhancing Feature Representation

After the attention mechanism enriches the embeddings with contextual information, the FFN refines these embeddings, allowing the model to extract higher-level features necessary for downstream tasks like translation or summarization.

How the FFN Processes Each Token

A key characteristic of the FFN in transformers is that it processes each token independently. Here’s what that means:

  • Parallel Processing: The same feedforward network is applied to each token’s embedding separately.
  • No Token Interaction: Unlike the attention mechanism, the FFN does not allow tokens to interact or share information at this stage.
  • Preserving Positional Integrity: By processing tokens independently, the FFN maintains the positional and contextual information established by the attention mechanism.

Comparison with Attention Mechanism:

  • Attention Mechanism: Enables tokens to consider information from other tokens via self-attention.
  • Feedforward Neural Network: Focuses on transforming each token’s embedding individually, adding depth and abstraction.

The Benefits of the Feedforward Neural Network

Enhanced Feature Extraction

The FFN adds another layer of abstraction, refining the token embeddings to capture higher-order features in the data.

Flexibility and Learning Capacity

By introducing non-linear transformations and additional parameters, the FFN enhances the model’s ability to generalize from training data to unseen examples.

Improved Performance

The depth provided by the FFN helps the transformer handle a broader range of language phenomena, improving performance in various NLP tasks.

The Mathematical Heart of the FFN

Mathematically, the FFN can be described by the following equations for each token’s embedding x:

First Layer with Activation

FFN₁(x) = ReLU(xW₁ + b₁)

Where:

· W₁ is the weight matrix for the first layer.

· b₁ is the bias vector.

The ReLU function applies non-linearity.

Second Layer

FFN₂(x) = FFN₁(x)W₂ + b₂

Where:

· W₂ is the weight matrix for the second layer.

· b₂ is the bias vector for the second layer.

No activation function is applied here.

Overall FFN Transformation

FFN(x) = FFN₂(x)

A Step-by-Step Recap of How the FFN Works

Input:

Receives embeddings of dimension $d_{model}$ (e.g., 512) from the attention layer.

First Layer:

Multiplies the input by W₁ and adds b₁.

Expands dimensionality to $d_{ff}$ (e.g., 2048).

Applies the ReLU activation function.

Second Layer:

Multiplies the result by W₂ and adds b₂.

Reduces dimensionality back to $d_{model}$ (e.g., 512).

No activation function is applied.

Output:

Produces refined embeddings ready for the next layer or output processing.

Putting It All Together: The FFN in the Transformer Architecture

While the attention mechanism allows tokens to interact and share information, the FFN adds depth to the model by applying complex, non-linear transformations to each token individually.

  • Residual Connections and Layer Normalization:
  • Similar to the attention mechanism, the FFN is wrapped with residual connections and layer normalization.
  • Residual Connection: Adds the original input to the FFN’s output to help with gradient flow during training.
  • Layer Normalization: Stabilizes and accelerates training by normalizing the inputs across the features.

5. Softmax Layer: Turning Scores into Probabilities

After passing through multiple transformer blocks consisting of self-attention mechanisms and feedforward neural networks, the model produces raw scores for each word in its vocabulary. However, these scores are not immediately useful for predicting the next word in a sequence. This is where the Softmax layer comes into play — it converts these raw scores into probabilities, allowing the model to make probabilistic predictions and generate more natural language.

The Need for Probabilistic Outputs

When a transformer model processes an input sequence, such as ‘How are’, it generates raw scores (also known as logits) for every possible next word in its vocabulary. The highest-scoring word might be ‘you’, completing the phrase as ‘How are you’. If the model always selects the word with the highest score, it becomes deterministic and may produce repetitive or robotic responses.

By introducing randomness through probabilistic outputs, the model can generate varied and contextually appropriate responses. This approach makes the language generation more natural and less predictable.

Raw Scores and Their Challenges

The raw scores generated by the model have several issues:

· Negative Scores: The scores can be negative or positive, but probabilities must be non-negative.

· Non-Normalized: The scores do not sum to 1, which is necessary for a valid probability distribution.

For example, consider the raw scores for a vocabulary of four words:

· Word 1: 1

· Word 2: 0

· Word 3: 4

· Word 4: -1

Simply normalizing these scores by dividing each by the sum would not resolve the issues of negative values or ensure that the probabilities sum to 1.

Introducing the Softmax Function

The Softmax function addresses these challenges by converting raw scores into a valid probability distribution. The formula for the Softmax function is:

P(i) = e^(s_i) / Σ_j e^(s_j)

Where:

· P(i) is the probability of word i.

· s_i is the raw score (logit) for word i.

· e^(s_i) is the exponential of the raw score.

· The denominator sums the exponentials of all raw scores in the vocabulary.

Properties of the Softmax Function

· Positive Outputs: Exponentials of any real number are positive, ensuring all probabilities are positive.

· Normalization: The probabilities sum to 1, satisfying the requirements of a probability distribution.

Applying Softmax: An Example

Using the raw scores provided earlier:

· Word 1: 1

· Word 2: 0

· Word 3: 4

· Word 4: -1

Step 1: Calculate Exponentials

· e¹ ≈ 2.718

· e⁰ = 1

· e⁴ ≈ 54.598

· e^-1 ≈ 0.368

Step 2: Sum the Exponentials

Total sum: 2.718 + 1 + 54.598 + 0.368 ≈ 58.684

Step 3: Compute Probabilities

· P(Word 1) = 2.718 / 58.684 ≈ 0.046 (4.6%)

· P(Word 2) = 1 / 58.684 ≈ 0.017 (1.7%)

· P(Word 3) = 54.598 / 58.684 ≈ 0.930 (93.0%)

· P(Word 4) = 0.368 / 58.684 ≈ 0.006 (0.6%)

Interpretation:

Word 3, with the highest raw score, now has the highest probability.

Words with lower or negative scores receive smaller, but non-zero, probabilities.

All probabilities are positive and sum to 1.

Benefits of Using Softmax

· Valid Probability Distribution: The Softmax function ensures that all output probabilities are between 0 and 1 and that they sum to 1, making them suitable for probabilistic interpretations.

· Reflects Model Confidence: By applying the exponential function, the Softmax accentuates differences in raw scores. Higher scores translate to significantly higher probabilities, reflecting the model’s confidence in its predictions.

· Handles Negative Scores: Negative raw scores are transformed into positive probabilities, ensuring that all words have a chance (however small) of being selected. This is crucial for generating varied and natural language.

Generating the Next Word Stochastically

Once the Softmax function has produced a probability distribution, the model can sample from this distribution to select the next word.

Advantages of Stochastic Sampling:

· Natural Language Variation: Introduces variability, making the generated text less repetitive and more human-like.

· Exploration of Alternatives: Allows the model to occasionally select less probable words, which can lead to more creative or diverse outputs.

Adjusting Randomness with Temperature

Some models introduce a temperature parameter (τ) to control the randomness of the output:

P(i) = e^(s_i / τ) / Σ_j e^(s_j / τ)

· Lower τ (< 1): Makes the probability distribution sharper, increasing the likelihood of selecting the highest-probability word (more deterministic).

· Higher τ (> 1): Flattens the distribution, giving lower-probability words a better chance (more randomness).

Integrating Softmax into the Transformer Model

In the overall architecture:

  1. Processing Input: The input sequence is tokenized, embedded, and passed through positional encoding.
  2. Transformer Blocks: The sequence is processed through multiple layers of self-attention and feedforward neural networks.
  3. Output Layer: The final hidden states are projected onto the vocabulary space, producing raw scores for each possible next word.
  4. Applying Softmax: The Softmax function converts these raw scores into probabilities.
  5. Prediction: The model samples from this probability distribution to generate the next word.

Training Considerations

During training, the model uses the Softmax outputs to compute the cross-entropy loss, comparing the predicted probabilities with the actual next words in the training data. The model adjusts its parameters to minimize this loss, effectively learning to assign higher probabilities to correct words.

Softmax in Action: Balancing Determinism and Randomness

The Softmax layer plays a pivotal role in balancing the model’s confidence and the introduction of randomness:

  • Deterministic Behavior: Selecting the word with the highest probability every time can lead to predictable and dull text.
  • Randomness: Sampling based on the probability distribution allows for more varied and interesting outputs.

By carefully adjusting the temperature parameter and leveraging the properties of the Softmax function, the model can generate text that is both coherent and engaging.

Summary

Transformers have revolutionized Natural Language Processing (NLP) since their introduction in 2017. Unlike traditional models like RNNs and LSTMs, transformers process entire sequences simultaneously using attention mechanisms, making them more efficient at capturing long-range dependencies in language.

Key Components of the Transformer Architecture:

  • Tokenization: Splitting input text into tokens (words, subwords, or characters) that the model can process.
  • Embeddings: Converting tokens into numerical vectors in a high-dimensional space to capture semantic meaning.
  • Positional Encoding: Adding positional information to embeddings to retain the order of tokens in sequences.
  • Self-Attention Mechanism: Allowing the model to focus on relevant parts of the input by weighing the importance of each token relative to others.
  • Multi-Head Attention: Using multiple attention mechanisms in parallel to capture different aspects of relationships between tokens within the same attention block.
  • Feedforward Neural Networks: Applying non-linear transformations to each token’s embedding independently to capture complex patterns.
  • Softmax Layer: Converting raw output scores into probabilities to predict the next word, enabling probabilistic and more natural language generation.

These components work together to enable transformers to excel in various NLP tasks such as machine translation, text summarization, question-answering, and language modeling. By leveraging attention mechanisms and processing data in parallel, transformers have set new standards for efficiency and performance in language understanding and generation.

--

--

Saurabh Harak

Hi, I'm a software developer/ML Engineer passionate about solving problems and delivering solutions through code. I love to explore new technologies.