Tokenization

Introduction

Tokenization constitutes one of the fundamental processes in natural language processing and deep learning applied to text. This process consists of the systematic decomposition of textual sequences into smaller units called "tokens", which can correspond to words, subwords, or even individual characters, depending on the strategy employed.

The need for tokenization arises from an inherent limitation of computational systems: they operate exclusively with numerical representations. While humans process language naturally through linguistic symbols, neural network architectures require all information to be encoded in the form of numerical vectors. Tokenization therefore acts as a bridge between the linguistic domain and the mathematical domain, allowing machine learning models to process, analyze, and generate text effectively.

Basic word-by-word tokenization

The most intuitive approach to tokenization consists of segmenting text using whitespace as natural delimiters between words. This method, although simple, allows understanding the fundamental principles of the tokenization process and establishes the foundations for more sophisticated techniques.

# Basic tokenization example
texto = "I like machine learning"

# Method 1: Using Python's split()
tokens = texto.split()
print("Original text:", texto)
print("Tokens:", tokens)
print("Number of tokens:", len(tokens))

Building a simple tokenizer

To advance beyond simple text splitting, it is necessary to build a system that not only segments words, but also establishes a one-to-one correspondence between each unique word and a numerical identifier. This mapping allows representing any text as a sequence of numbers, facilitating its processing by machine learning models.

The implementation of a basic tokenizer requires maintaining two complementary data structures: a dictionary that maps words to numbers and another that performs the inverse transformation. Additionally, a mechanism is needed to assign unique identifiers to each new word found during the training process.

class SimpleTokenizer:
    """
    A basic tokenizer that splits text into words
    and assigns them unique numbers.
    """

    def __init__(self):
        # Dictionary to store word -> number
        self.word_to_number = {}
        # Inverse dictionary: number -> word
        self.number_to_word = {}
        # Counter to assign numbers
        self.next_number = 0

    def train(self, texts):
        """
        Learns which words exist in the texts.

        Args:
            texts: List of strings with training texts
        """
        for text in texts:
            # Convert to lowercase and split into words
            words = text.lower().split()

            # For each word, if we haven't seen it, assign it a number
            for word in words:
                if word not in self.word_to_number:
                    self.word_to_number[word] = self.next_number
                    self.number_to_word[self.next_number] = word
                    self.next_number += 1

        print(f"Learned vocabulary: {len(self.word_to_number)} words")

    def encode(self, text):
        """
        Converts text into a list of numbers.
        """
        words = text.lower().split()
        numbers = []

        for word in words:
            if word in self.word_to_number:
                numbers.append(self.word_to_number[word])
            else:
                # If we don't know the word, use -1
                numbers.append(-1)

        return numbers

    def decode(self, numbers):
        """
        Converts a list of numbers back to text.
        """
        words = []

        for number in numbers:
            if number in self.number_to_word:
                words.append(self.number_to_word[number])
            else:
                words.append("[UNKNOWN]")

        return " ".join(words)

    def show_vocabulary(self):
        """Shows all words the tokenizer knows."""
        print("\nComplete vocabulary:")
        print("-" * 40)
        for word, number in sorted(self.word_to_number.items(), key=lambda x: x[1]):
            print(f"{number:3d} -> {word}")

# Usage example
print("=" * 50)
print("EXAMPLE 1: Simple Tokenizer")
print("=" * 50)

# Training texts
training_texts = ["i like programming", "i like learning", "programming is fun"]

# Create and train the tokenizer
tokenizer = SimpleTokenizer()
tokenizer.train(training_texts)

# Show the learned vocabulary
tokenizer.show_vocabulary()

# Test encoding
new_text = "i like learning programming"
print(f"\nText to encode: '{new_text}'")

encoded = tokenizer.encode(new_text)
print(f"Encoded text: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded text: '{decoded}'")

# Test with unknown word
unknown_text = "i like cooking"
print(f"\nText with new word: '{unknown_text}'")
encoded_unk = tokenizer.encode(unknown_text)
print(f"Encoded: {encoded_unk}")
print("Note: -1 indicates unknown word")

Special tokens and vocabulary management

Tokenization in real-world natural language processing applications presents challenges that go beyond simple word-to-number conversion. Situations arise that require special treatment: words that did not appear during training, sequences of different lengths that must be processed in batches, and the need to explicitly mark the beginning and end of sequences.

To address these issues, modern tokenization systems incorporate special tokens with specific functions. The padding token allows uniformizing sequence length, facilitating parallel processing. The unknown word token provides a consistent representation for terms not seen during training. The start and end of sequence tokens allow models to explicitly identify the boundaries of each input, which is especially relevant in text generation and machine translation tasks.

class TokenizerWithSpecials:
    """
    Tokenizer that handles unknown words and padding.
    """

    def __init__(self):
        # Special tokens
        self.PAD = "[PAD]"  # For padding short sequences
        self.UNK = "[UNK]"  # For unknown words
        self.SOS = "[SOS]"  # Start of Sequence
        self.EOS = "[EOS]"  # End of Sequence

        # Initialize dictionaries with special tokens
        self.word_to_number = {self.PAD: 0, self.UNK: 1, self.SOS: 2, self.EOS: 3}

        self.number_to_word = {0: self.PAD, 1: self.UNK, 2: self.SOS, 3: self.EOS}

        self.next_number = 4

    def train(self, texts):
        """Learns the vocabulary from texts."""
        for text in texts:
            words = text.lower().split()

            for word in words:
                if word not in self.word_to_number:
                    self.word_to_number[word] = self.next_number
                    self.number_to_word[self.next_number] = word
                    self.next_number += 1

        print(f"Vocabulary: {len(self.word_to_number)} words")
        print(f"  - Special words: 4")
        print(f"  - Normal words: {len(self.word_to_number) - 4}")

    def encode(self, text, add_special=True, fixed_length=None):
        """
        Encodes text with advanced options.

        Args:
            text: Text to encode
            add_special: Whether to add [SOS] and [EOS]
            fixed_length: If specified, adjusts to this length
        """
        words = text.lower().split()

        # Convert words to numbers
        numbers = []
        for word in words:
            if word in self.word_to_number:
                numbers.append(self.word_to_number[word])
            else:
                numbers.append(self.word_to_number[self.UNK])

        # Add start and end tokens if requested
        if add_special:
            numbers = (
                [self.word_to_number[self.SOS]]
                + numbers
                + [self.word_to_number[self.EOS]]
            )

        # Adjust to fixed length if specified
        if fixed_length is not None:
            if len(numbers) < fixed_length:
                # Pad with PAD
                numbers = numbers + [self.word_to_number[self.PAD]] * (
                    fixed_length - len(numbers)
                )
            else:
                # Truncate
                numbers = numbers[:fixed_length]

        return numbers

    def decode(self, numbers, remove_special=True):
        """Decodes numbers to text."""
        words = []

        for number in numbers:
            if number in self.number_to_word:
                word = self.number_to_word[number]

                # Skip special tokens if requested
                if remove_special and word in [self.PAD, self.UNK, self.SOS, self.EOS]:
                    continue

                words.append(word)

        return " ".join(words)

# Usage examples
print("\n" + "=" * 50)
print("EXAMPLE 2: Special Tokens and Padding")
print("=" * 50)

# Train
texts = ["hello world", "python is great", "i like learning"]

tokenizer_v2 = TokenizerWithSpecials()
tokenizer_v2.train(texts)

# Encode without fixed length
text1 = "hello python"
print(f"\nText 1: '{text1}'")
cod1 = tokenizer_v2.encode(text1)
print(f"Encoded: {cod1}")
print(f"Length: {len(cod1)}")

# Encode with fixed length
text2 = "i like"
print(f"\nText 2: '{text2}'")
cod2 = tokenizer_v2.encode(text2, fixed_length=10)
print(f"Encoded (fixed length=10): {cod2}")
print(f"Length: {len(cod2)}")

# Decode
print(f"Decoded: '{tokenizer_v2.decode(cod2)}'")

Visualization of the tokenization process

Deep understanding of the tokenization process is facilitated through explicit visualization of the transformations the text undergoes at each stage. Observing how a sequence of words is converted into a sequence of numerical identifiers, and can subsequently be recovered as original text, allows identifying potential problems and understanding the system's behavior with different inputs.

def visualize_tokenization(tokenizer, texts):
    """
    Visually shows how each text is tokenized.
    """
    print("\n" + "=" * 60)
    print("TOKENIZATION VISUALIZATION")
    print("=" * 60)

    for i, text in enumerate(texts, 1):
        print(f"\n{i}. Original text:")
        print(f"   '{text}'")

        # Encode
        encoded = tokenizer.encode(text, add_special=True)

        print(f"\n   Tokens (numbers):")
        print(f"   {encoded}")

        print(f"\n   Visual representation:")
        # Show each token with its word
        words = ["[SOS]"] + text.lower().split() + ["[EOS]"]
        for word, number in zip(words, encoded):
            print(f"   {word:15} -> {number:3}")

        print(f"\n   Decoded:")
        print(f"   '{tokenizer.decode(encoded)}'")
        print("-" * 60)

# Visualization example
example_texts = [
    "python is great",
    "i like programming",
    "hello artificial intelligence",
]

visualize_tokenization(tokenizer_v2, example_texts)

Length normalization through padding

One of the most relevant technical aspects in textual sequence processing is the management of variable lengths. Natural texts present considerable diversity in terms of their length: from brief phrases of few words to extensive paragraphs with dozens or hundreds of tokens. However, neural network architectures, especially when processing multiple examples simultaneously in batches, require all input sequences to have uniform dimensions.

Padding constitutes the standard solution to this problem. It consists of artificially extending shorter sequences until reaching a target length, typically determined by the longest sequence in the batch. This extension is performed through the insertion of special padding tokens that the model learns to ignore during processing. Alternatively, when a sequence exceeds the maximum allowed length, truncation is applied, preserving only the first tokens up to the established limit.

def compare_lengths(tokenizer, texts):
    """
    Compares the lengths of different texts and shows
    how padding uniformizes them.
    """
    print("\n" + "=" * 60)
    print("LENGTH COMPARISON")
    print("=" * 60)

    # Find maximum length
    lengths = []
    for text in texts:
        cod = tokenizer.encode(text, add_special=True)
        lengths.append(len(cod))

    max_length = max(lengths)

    print(f"\nMaximum length found: {max_length} tokens")
    print("\nComparison:")
    print("-" * 60)

    for text in texts:
        # Without padding
        without_padding = tokenizer.encode(text, add_special=True)

        # With padding
        with_padding = tokenizer.encode(text, add_special=True, fixed_length=max_length)

        print(f"\nText: '{text}'")
        print(f"Without padding (length {len(without_padding)}): {without_padding}")
        print(f"With padding (length {len(with_padding)}): {with_padding}")

        # Count how many PADs were added
        num_pads = with_padding.count(0)
        print(f"PADs added: {num_pads}")

# Comparison example
different_texts = ["hello", "python is great", "i like learning programming in python"]

compare_lengths(tokenizer_v2, different_texts)

Practical application: Review processing system

The integration of all presented concepts materializes in the construction of a complete system capable of processing real-world text. A representative use case is product review analysis, where the objective consists of transforming opinions expressed in natural language into numerical representations that can subsequently feed sentiment classification models or other analysis tasks.

This system integrates the tokenizer with special token management capabilities, length normalization, and vocabulary maintenance built from a training set. The resulting architecture allows processing new reviews consistently, applying the same transformations that will be used during machine learning model training.

class ReviewSystem:
    """
    Complete system for processing product reviews.
    """

    def __init__(self):
        self.tokenizer = TokenizerWithSpecials()
        self.reviews = []
        self.labels = []  # 1 = positive, 0 = negative

    def add_review(self, text, is_positive):
        """Adds a review to the system."""
        self.reviews.append(text)
        self.labels.append(1 if is_positive else 0)

    def train(self):
        """Trains the tokenizer with all reviews."""
        print("Training tokenizer with reviews...")
        self.tokenizer.train(self.reviews)

    def process_review(self, text, length=15):
        """Processes a new review."""
        print(f"\nProcessing: '{text}'")
        print("-" * 50)

        # Tokenize
        tokens = text.lower().split()
        print(f"1. Split into words: {tokens}")

        # Encode
        encoded = self.tokenizer.encode(text, add_special=True, fixed_length=length)
        print(f"2. Convert to numbers: {encoded}")

        # Decode
        decoded = self.tokenizer.decode(encoded)
        print(f"3. Decode: '{decoded}'")

        return encoded

    def show_statistics(self):
        """Shows dataset statistics."""
        print("\n" + "=" * 60)
        print("SYSTEM STATISTICS")
        print("=" * 60)

        print(f"\nTotal reviews: {len(self.reviews)}")
        print(f"Positive reviews: {sum(self.labels)}")
        print(f"Negative reviews: {len(self.labels) - sum(self.labels)}")
        print(f"Vocabulary: {len(self.tokenizer.word_to_number)} words")

        # Lengths
        lengths = [len(r.split()) for r in self.reviews]
        print(f"\nAverage length: {sum(lengths)/len(lengths):.1f} words")
        print(f"Minimum length: {min(lengths)} words")
        print(f"Maximum length: {max(lengths)} words")

# Create the system
print("=" * 60)
print("MINI PROJECT: Review System")
print("=" * 60)

system = ReviewSystem()

# Add training reviews
training_reviews = [
    ("this product is excellent", True),
    ("very bad quality do not recommend", False),
    ("incredible i love it", True),
    ("terrible experience", False),
    ("perfect product arrived fast", True),
    ("does not work properly", False),
]

print("\nAdding reviews to the system...")
for text, is_positive in training_reviews:
    system.add_review(text, is_positive)
    sentiment = "POSITIVE" if is_positive else "NEGATIVE"
    print(f"  - [{sentiment}] {text}")

# Train
system.train()

# Show statistics
system.show_statistics()

# Process new reviews
print("\n" + "=" * 60)
print("PROCESSING NEW REVIEWS")
print("=" * 60)

new_reviews = [
    "excellent product very good",
    "bad experience terrible",
    "perfect recommend",
]

for review in new_reviews:
    system.process_review(review)