Find the Longest Common Subsequence (LCS) Between Two Strings

problem-solving
Published

June 26, 2024

Finding the longest common subsequence (LCS) is a classic computer science problem with applications in various fields, including bioinformatics (comparing DNA sequences), version control systems (identifying changes between files), and spell checking. This post will look at how to efficiently solve this problem in Python using dynamic programming.

Understanding the Problem

A subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements. For example, “abc” is a subsequence of “abcabcbb”. The longest common subsequence problem aims to find the longest subsequence that is common to two or more input sequences.

Let’s consider two strings:

  • String A: “AGGTAB”
  • String B: “GXTXAYB”

The longest common subsequence is “GTAB”. Note that it’s not necessarily a contiguous substring; the characters can be scattered throughout the original strings.

Dynamic Programming Approach

Dynamic programming provides an efficient way to solve the LCS problem. We’ll build a table (matrix) to store the lengths of common subsequences for all possible prefixes of the two input strings.

The algorithm works as follows:

  1. Initialization: Create a 2D array (matrix) dp of size (m+1) x (n+1), where ‘m’ and ‘n’ are the lengths of strings A and B respectively. Initialize the first row and column to 0.

  2. Iteration: Iterate through the strings. For each pair of characters A[i] and B[j]:

    • If A[i] equals B[j], it means we’ve found a common character. The length of the LCS increases by 1: dp[i+1][j+1] = dp[i][j] + 1.
    • Otherwise, the length of the LCS is the maximum of the LCS of the prefixes ending at A[i] and B[j-1] or A[i-1] and B[j]: dp[i+1][j+1] = max(dp[i][j+1], dp[i+1][j]).
  3. Result: The value dp[m][n] contains the length of the LCS.

Python Code Implementation

Here’s a Python function that implements the dynamic programming solution:

def longest_common_subsequence(str1, str2):
    m = len(str1)
    n = len(str2)

    # Create a table to store results of subproblems
    dp = [[0 for x in range(n + 1)] for x in range(m + 1)]

    # Fill dp[][] in bottom up manner
    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0 or j == 0:
                dp[i][j] = 0
            elif str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    return dp[m][n]

# Example usage
stringA = "AGGTAB"
stringB = "GXTXAYB"
length = longest_common_subsequence(stringA, stringB)
print(f"Length of LCS: {length}")  # Output: Length of LCS: 4


#Function to print the actual LCS
def print_lcs(str1,str2):
    m = len(str1)
    n = len(str2)
    dp = [[0 for x in range(n + 1)] for x in range(m + 1)]
    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0 or j == 0:
                dp[i][j] = 0
            elif str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    i = m
    j = n
    lcs = ""
    while i>0 and j>0:
        if str1[i-1] == str2[j-1]:
            lcs = str1[i-1]+lcs
            i-=1
            j-=1
        else:
            if dp[i-1][j]>dp[i][j-1]:
                i-=1
            else:
                j-=1
    print(f"LCS is: {lcs}")
print_lcs(stringA,stringB) #Output: LCS is: GTAB

This code efficiently calculates the length of the LCS and demonstrates how to extract the LCS itself. The dynamic programming approach ensures a time complexity of O(m*n), where ‘m’ and ‘n’ are the lengths of the input strings. This is more efficient than brute-force approaches.

Optimizations

For extremely large strings, memory usage could become a concern. One optimization technique involves using space-efficient dynamic programming, which reduces the memory footprint to O(min(m, n)) by only storing the current and previous rows of the dp matrix.