The Boyer-Moore algorithm is a highly efficient string-searching algorithm. Unlike naive string search, which checks every possible position, Boyer-Moore uses heuristics to skip characters and reduce the number of comparisons. This makes it particularly useful for searching large texts for relatively short patterns. This post will guide you through implementing the Boyer-Moore algorithm in Python.
Understanding the Heuristics
The algorithm’s efficiency stems from two key heuristics:
- Bad Character Heuristic: If a mismatch occurs at a character
c
in the pattern, the algorithm shifts the pattern to the right until either:c
appears in the pattern at or to the right of the mismatch position, or- the end of the pattern is reached.
- Good Suffix Heuristic: If a mismatch occurs, the algorithm examines the suffix of the pattern that matches the text. It then shifts the pattern based on the occurrences of this matching suffix within the pattern itself.
Python Implementation
Let’s break down a Python implementation of the Boyer-Moore algorithm:
def boyer_moore(text, pattern):
"""
Implements the Boyer-Moore string matching algorithm.
Args:
text: The text to search within.
pattern: The pattern to search for.
Returns:
A list of indices where the pattern is found in the text. Returns an empty list if no matches are found.
"""
= len(pattern)
m = len(text)
n if m > n:
return [] # Pattern is longer than the text
# Preprocessing: Create bad character and good suffix tables
= {}
bad_char for i in range(m - 1):
= m - 1 - i
bad_char[pattern[i]]
= {}
suffix for i in range(m):
= m
suffix[i]
for i in range(m - 1):
= i
j = 0
k while j >= 0 and pattern[j] == pattern[m - 1 - k]:
-= 1
j += 1
k = j + 1
suffix[k]
for i in range(m - 1):
=m-1-i
j=1
kwhile k <= m-1-i and j<m-1:
if pattern[j]==pattern[m-1-k]:
=j+1
suffix[k]=j-1
j=k+1
kelse:
break
# Search
= []
occurrences = 0
i while i <= n - m:
= m - 1
j while j >= 0 and pattern[j] == text[i + j]:
-= 1
j if j == -1:
occurrences.append(i)+= 1 # Shift by one after a match
i else:
= 0
shift if text[i + j] in bad_char:
= max(shift, j - bad_char[text[i + j]])
shift
= suffix[1]
suffix_shift if j < m - 1 and j > 0:
=1
kwhile k <=j+1:
if pattern[m-1-k]==text[i+j]:
=min(suffix_shift,j+1-suffix[k])
suffix_shift=k+1
k
= max(shift, suffix_shift)
shift += shift
i
return occurrences
# Example Usage
= "This is a test string to demonstrate the Boyer-Moore algorithm."
text = "algorithm"
pattern = boyer_moore(text, pattern)
matches print(f"Pattern '{pattern}' found at indices: {matches}")
= "ABABDABACDABABCABAB"
text = "ABAB"
pattern = boyer_moore(text, pattern)
matches print(f"Pattern '{pattern}' found at indices: {matches}")
This code implements both the bad character and good suffix heuristics. The bad_char
and suffix
dictionaries are pre-computed for efficiency. The main search loop iterates through the text, utilizing the heuristics to efficiently skip portions of the text.
Optimizations and Further Reading
While this implementation provides a good understanding of the algorithm, further optimizations are possible, particularly for very large texts and patterns. Consider exploring these avenues for enhanced performance:
- Using more complex data structures: Replacing dictionaries with more efficient structures for specific cases.
- Further refinement of the good suffix heuristic: Implementing variations for improved shift calculations.
This post provides a solid foundation for understanding and implementing the Boyer-Moore algorithm. Experiment with different texts and patterns to observe its efficiency firsthand.