Implement the KMP (Knuth-Morris-Pratt) String Matching Algorithm

problem-solving
Published

June 2, 2024

The Knuth-Morris-Pratt (KMP) algorithm is a powerful string searching algorithm renowned for its efficiency. Unlike naive string searching, which can have a time complexity of O(mn) (where ‘m’ is the length of the pattern and ‘n’ is the length of the text), KMP boasts a linear time complexity of O(m+n). This significant improvement stems from its clever use of a pre-processed “partial match” table. This post will walk you through the implementation of the KMP algorithm in Python, providing explanations and code examples to help you understand its inner workings.

Understanding the Partial Match Table (PMT)

The heart of the KMP algorithm lies in the partial match table (PMT), also known as the failure function. This table, generated from the search pattern, tells us how far to shift the pattern when a mismatch occurs. For each index i in the pattern, PMT[i] represents the length of the longest proper prefix of the pattern that is also a suffix of the pattern[0…i].

Let’s illustrate with an example. Consider the pattern “ABABCABAB”. The PMT would be calculated as follows:

Index (i) Pattern[0…i] Longest Proper Prefix which is also a Suffix PMT[i]
0 A - 0
1 AB - 0
2 ABA A 1
3 ABAB AB 2
4 ABABC AB 2
5 ABABCA A 1
6 ABABCAB AB 2
7 ABABCABA ABA 3
8 ABABCABAB ABAB 4

Notice that if a mismatch occurs at index i in the text, we can shift the pattern by i - PMT[i] positions. This avoids redundant comparisons.

Python Implementation

Here’s a Python implementation of the KMP algorithm, including the PMT generation:

def kmp_matcher(text, pattern):
    """
    Implements the Knuth-Morris-Pratt (KMP) string matching algorithm.

    Args:
        text: The text string to search within.
        pattern: The pattern string to search for.

    Returns:
        A list of indices where the pattern is found in the text.  Returns an empty list if the pattern is not found.
    """

    m = len(pattern)
    n = len(text)
    
    # PMT Generation
    pmt = [0] * m
    length = 0
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            pmt[i] = length
            i += 1
        else:
            if length != 0:
                length = pmt[length - 1]
            else:
                i += 1

    # Matching
    occurrences = []
    i = 0
    j = 0
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        if j == m:
            occurrences.append(i - j)
            j = pmt[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = pmt[j - 1]
            else:
                i += 1

    return occurrences


# Example Usage
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
indices = kmp_matcher(text, pattern)
print(f"Pattern found at indices: {indices}")  # Output: Pattern found at indices: [10]

text = "THIS IS A TEST TEXT"
pattern = "TEST"
indices = kmp_matcher(text, pattern)
print(f"Pattern found at indices: {indices}") # Output: Pattern found at indices: [10]

text = "AABBAC"
pattern = "AAB"
indices = kmp_matcher(text,pattern)
print(f"Pattern found at indices: {indices}") # Output: Pattern found at indices: [0]

This code first generates the PMT and then uses it to efficiently search for the pattern within the text. The kmp_matcher function returns a list of the starting indices of all occurrences of the pattern in the text.

Analyzing the Time Complexity

The PMT generation takes O(m) time, and the matching process also takes O(n) time in the worst case. Therefore, the overall time complexity of the KMP algorithm is O(m + n), making it faster than naive string searching for larger texts and patterns. This efficiency makes KMP a preferred choice for many string searching applications.