a shot of dev knowledge

The **Knuth-Morris-Pratt (KMP) algorithm** is an algorithm that is used to search for a substring (`W`

), in a given string (`S`

), in $O(m+n)$ time (where $m$ and $n$ are the lengths of `W`

and `S`

).

The key idea used to achieve this time complexity is to minimize the amount of backtracking when a character of `W`

does not match with that of `S`

. This can only be done if we know two things:

- Whether or not a proper prefix of
`W`

occurs more than once in`S`

â€‹after at least one character has been correctly found; if it does, it can be skipped when resuming the process of matching after a mismatch. - Length of the proper prefix.

1 character matched

Before starting the actual algorithm, a one-dimensional array is initialized with the number of characters that can be skipped after a mismatch. `arr[i]`

represents the number of characters that can be skipped when `W[i+1]`

does not match with a character in `S`

.

In the example above, `W[4]`

(i.e., `a`

) did not match with the character `c`

in `S`

(slide 5), so, the matching was resumed from `W[1]`

instead of `W[0]`

(slide 6). This means that character *1* was skipped; therefore, * arr[4-1] = 1*.

`arr`

is initialized using the failure function $f(i)$. This function is based on the fact that:

*When a mismatch occurs, all the previous characters match correctly; â€‹this implies that if a prefix of W occurred in this set of matching characters, then that prefix is also a suffix of W*.

In other words, `arr[i]`

will represent the length of the longest proper prefix of `W`

, which is also a proper suffix of `W`

(considering `W`

till the `i`

th index only).

The following is the code used to initialize `arr`

:

```
m = len(W)
i = 0
j = 1
# No proper prefix for string of length 1:
arr[0] = 0
while j < m:
if W[i] == W[j]:
i++
arr[j] = i
j++
# The first character didn't match:
else if i == 0:
arr[j] = 0
j++
# Mismatch after at least one matching character:
else:
i = arr[i - 1]
```

Initialization

The following is the KMP algorithm pseudocode used for searching substring `W`

, in string `S`

, in $O(m+n)$:

```
# Initialising variables:
i = 0
j = 0
m = len(W)
n = len(S)
# Start search:
while (i < m && j < n){
# Last character matches -> Substring found:
if (W[i] == S[j] && i == m - 1):
return true
# Character matches:
else if (W[i] == S[j]):
i++
j++
# Character didn't match -> Backtrack:
else:
i = arr[i - 1]
if i == 0:
j++
}
# Substring not found:
return false
```

The algorithm above (given in the form of a pseudocode) along with the failure function, has been implemented in Python:

# Initialise array based on the failure function: def init_arr(w): m = len(w) i = 0 j = 1 # No proper prefix for string of length 1: global arr arr[0] = 0 while j < m: if w[i] == w[j]: i += 1 arr[j] = i j += 1 # The first character didn't match: elif i == 0: arr[j] = 0 j += 1 # Mismatch after at least one matching character: else: i = arr[i - 1] def kmp_search(w, s): # Initialise array: init_arr(w) # Initialising variables: i = 0 j = 0 m = len(w) n = len(s) # Start search: while i < m and j < n: # Last character matches -> Substring found: if w[i] == s[j] and i == m - 1: return True # Character matches: elif w[i] == s[j]: i += 1 j += 1 # Character didn't match -> Backtrack: else: if i != 0: i = arr[i-1] else: j += 1 # Substring not found: return False text = "hayhello" substring = "hell" # create array: arr = [None] * len(substring) if kmp_search(substring, text): print("Substring found!") else: print("Could not find substring.")

RELATED COURSES

View all Courses

Keep Exploring

Related Courses