topic
analysis
Overview
This question is to find the needle string in the haystack string. 3.solutions are given below, all of which are based on sliding windows.
The simplest solution is to compare substrings one by one. The sliding window of length L is gradually moved along the haystack string, and the substring in the window is compared with the needle string. The time complexity is O((NL)L )
Show the above method can be optimized. Although the double pointer method is also linear time complexity, it can avoid comparing all substrings. Therefore, the time complexity in the optimal case is O(N), but the time complexity in the worst case is still O((N L)L).
Is there a solution with O(N) complexity? The answer is yes, there are two ways to achieve it:
 RabinKarp, through a hash algorithm to achieve string comparison within a constant time window.
 Bit operation, through bit mask to achieve string comparison within a constant time window.
Method 1: Compare the substrings one by onelinear time complexity
The most straightforward methodmove the sliding window step by step along the character change, and compare the substring in the window with the needle string.
achieve
class Solution {
public int strStr (String haystack, String needle) {
int L = needle.length(), n = haystack.length();
for ( int start = 0 ; start <nL + 1 ; ++start) {
if (haystack.substring(start, start + L).equals(needle)) {
return start;
}
}
return  1 ;
}
}
Copy code
Complexity analysis
 Time complexity: O((NL)L), where N is the length of the haystack string and L is the length of the needle string. The complexity of comparing strings in the inner loop is L, and a total of (NL) comparisons are required.
 Space complexity: O(1).
Method 2: Double pointerlinear time complexity
The flaw of the previous method is that all substrings of length L of haystack are compared with the needle string, which is actually unnecessary.
1. the comparison is only required when the first character of the substring is the same as the first character of the needle string.
Secondly, it can be compared character by character, and once it does not match, it will terminate immediately.
As shown in the figure below, a mismatch is found when the last bit is compared, and the backtracking starts at this time. Note that, the pointer is moved to the PN pn = pn  curr_len + 1 position, rather than pn = pn  curr_len position.
At this time, we compare again and find a complete matching substring, and return the starting position pnL of the substring directly.
algorithm
 Move the pn pointer until the character at the position pointed to by pn is equal to the first character of the needle string.
 Calculate the matching length through pn, pL, and curr_len.
 If it matches exactly (ie curr_len == L), return the starting coordinates of the matched substring (ie pnL).
 If it doesn't match exactly, backtrack. Let pn = pncurr_len + 1, pL = 0, and curr_len = 0.
achieve
class Solution {
public int strStr (String haystack, String needle) {
int L = needle.length(), n = haystack.length();
if (L == 0 ) return 0 ;
int pn = 0 ;
while (pn <nL + 1 ) {
//find the position of the first needle character
//in the haystack string
while (pn <nL + 1 && haystack.charAt(pn) != needle.charAt( 0 )) ++pn;
//compute the max match string
int currLen = 0 , pL = 0 ;
while (pL <L && pn <n && haystack.charAt(pn) == needle.charAt(pL)) {
++pn;
++pL;
++currLen;
}
//if the whole needle string is found,
//return its start position
if (currLen == L) return pnL;
//otherwise, backtrack
pn = pncurrLen + 1 ;
}
return  1 ;
}
}
Copy code
Complexity analysis
 Time complexity: The worst time complexity is O((NL)L), and the optimal time complexity is O(N).
 Space complexity: O(1).
Method 3: Rabin Karpconstant complexity
There is an algorithm whose worst time complexity is also O(N). The idea is this, first generate the hash code of the substring in the window, and then compare it with the hash code of the needle string.
This idea has a problem to be solved, how to generate the hash code of the substring in constant time?
Rolling hash: constant time to generate hash code
It takes O(L) time to generate a hash code with a length of L array.
How to generate the hash code of the sliding window array in constant time? Using the characteristics of the sliding window, one element enters and one comes out every time you slide.
Since only lowercase English letters appear, the string can be converted into an array of integers with values from 0 to 25:
h 0 = 0 2 6 3 + 1 2 6 2 + 2 2 6 1 + 3 2 6 0 h_0 = 0/times 26^3 + 1/times 26^2 + 2/times 26^1 + 3/times 26^0 h0 =0 263+1 262+2 261+3 260
The above formula can be written as a general formula, as shown below. Where c_i c**i is an element in an integer array, a = 26 a = 26, which is the number of character sets.
h 0 = c 0 a L 1 + c 1 a L 2 +... + cia L 1 i +... + c L 1 a 1 + c L a 0 0 h_0 = c_0 a^{ L1} + c_1 a^{L2} + ... + c_i a^{L1i} + ... + c_{L1} a^1 + c_L a^00 h0 = c0 aL 1+c1 aL 2+...+ci aL 1 i+...+cL 1 a1+cL a00
h 0 = i = 0 L 1 cia L 1 i h_0 =/sum_{i = 0}^{L1}{c_i a^{L1i}} h0 = i=0L 1 ci aL 1 i
Let s consider the window from
h 1 = (h 0 0 2 6 3) 26 + 4 2 6 0 h_1 = (h_00/times 26^3)/times 26 + 4/times 26^0 h1 =(h0 0 263) 26+4 260
Written as a general formula as shown below.
h 1 = (h 0 a c 0 a L) + c L + 1 h_1 = (h_0 ac_0 a^L) + c_(L + 1) h1 =(h0 a c0 aL)+cL +1
How to avoid overflow
a^L a**L may be a very large number, so you need to set an upper limit to avoid overflow. The upper limit of the value can be set by modulo, namely
Theoretically, modules should be a large number, but what should be the specific number? See this article for details . For this problem, $ 2^{31} $ is enough.
algorithm

Calculate substring
haystack.substring(0, L)withneedle.substring(0, L)The hash value. 
Traverse from the starting position: traverse from the first character to the NLth character.
 Calculate the rolling hash based on the previous hash value.
 If the hash value of the substring is equal to the hash value of the needle string, the starting position of the sliding window is returned.

1 is returned, and the needle string does not exist in the haystack string at this time.
achieve
class Solution {
//function to convert character to integer
public int charToInt ( int idx, String s) {
return ( int )s.charAt(idx)( int ) 'a' ;
}
public int strStr (String haystack, String needle) {
int L = needle.length(), n = haystack.length();
if (L> n) return  1 ;
//base value for the rolling hash function
int a = 26 ;
//modulus value for the rolling hash function to avoid overflow
long modulus = ( long )Math.pow( 2 , 31 );
//compute the hash of strings haystack[:L], needle[:L]
long h = 0 , ref_h = 0 ;
for ( int i = 0 ; i <L; ++i) {
h = (h * a + charToInt(i, haystack))% modulus;
ref_h = (ref_h * a + charToInt(i, needle))% modulus;
}
if (h == ref_h) return 0 ;
//const value to be used often: a**L% modulus
long aL = 1 ;
for ( int i = 1 ; i <= L; ++i) aL = (aL * a)% modulus;
for ( int start = 1 ; start <nL + 1 ; ++start) {
//compute rolling hash in O(1) time
h = (h * acharToInt(start 1 , haystack) * aL
+ charToInt(start + L 1 , haystack))% modulus;
if (h == ref_h) return start;
}
return  1 ;
}
}
Copy code
Complexity analysis
 Time complexity: O(N), it takes O(L) O ( L ) time to calculate the hash value of the needle string , and then it needs to be executed (NL)( N L ) cycles , and the calculation of each cycle is complicated Degree is constant.
 Space complexity: O(1).
Method 4: Sunday algorithm
1. Sunday matching mechanism
The matching mechanism is very easy to understand:
 Target stringString
 Pattern string Pattern
 Current query index idx(Initially 0)
 String to be matched str_cut:String [idx: idx + len(Pattern)]
Are each hit from the target string extracted to be matched string with the pattern string matching:

If it matches, return the current
idx 
Do not match, check to be matched string after a character
c: IfcExist inPatternIn, thenidx = idx + offset table [c]
 otherwise,idx = idx + len(pattern)
 If
Repeat Loop until
2. the offset table
Using the offset table is stored in each of the pattern string of characters that appear in the pattern string rightmost position +1 appears to distance the tail, e.g. aab:
 The offset of a is len(pattern)1 = 2
 The offset of b is len(pattern)2 = 1
 The others are len(pattern)+1 = 4
To summarize:
3. for example
String:
Pattern:
Step 1:
 idx = 0
 The string to be matched is:chec
 because chec != this
 So check checNext characterk
 kNot in pattern
 So look at the offset table,idx = idx + 5
Step 2:
 idx = 5
 The string to be matched is:this
 because this == this
 Matches, so return 5
4. algorithm analysis
Worst case: O(nm)
Average case: O(n)
5. the code
class Solution :
def strStr ( self , haystack : str , needle : str ) > int :
# Func : Calculate the offset table
def calShiftMat ( st ):
dic = ()
for i in range (len(st) 1,1,1):
if not dic. get (st[i]) :
dic[st[i]] = len(st)i
dic[ "ot" ] = len(st)+ 1
return dic
# Other situation judgment
if len (needle) > len (haystack) :return 1
if needle == "" : return 0
# Offset table preprocessing
dic = calShiftMat(needle)
idx = 0
while idx+len(needle) <= len(haystack):
# String to be matched
str_cut = haystack[idx:idx+len(needle)]
# Determine if it matches
if str_cut==needle:
return idx
else :
# Boundary processing
if idx+len(needle) >= len(haystack):
return  1
# In case of mismatch, move idx according to the offset of the next character
cur_c = haystack[idx+len(needle)]
if dic.get(cur_c):
idx += dic[cur_c]
else :
idx += dic[ "ot" ]
return  . 1 IF IDX + len (Needle)> = len (of haystack) the else IDX
duplicated code
Method five: KMP algorithm
The KMP algorithm (KnuthMorrisPratt algorithm) is a wellknown string matching algorithm, which is very efficient, but it is indeed a bit complicated.
Many readers complain that the KMP algorithm cannot be understood. This is normal. Thinking about the explanation of the KMP algorithm in university textbooks, I don't know how many future Knuth, Morris, and Pratt have been dismissed in advance. There are some excellent students who help understand the algorithm by pushing the KMP algorithm by hand. This is one way, but this article will help readers understand the principle of the algorithm from a logical level. Between ten lines of code, KMP was wiped out.
First agreed at the beginning, this article uses
Why I think the KMP algorithm is a dynamic programming problem, I will explain it later. For dynamic programming, we have repeatedly emphasized the need to be clear
The KMP algorithm that readers have seen should be a wave of weird operations
PS: The code in this article refers to "Algorithm 4", the name of the array used by the original code is
This article will use dynamic programming algorithm design techniques (inductive ideas), so I hope readers have read this article "The Longest Increasing Subsequence of Dynamic Programming Design", it is easy to understand.
1. Overview of KMP algorithm
First of all, let's briefly introduce the difference between the KMP algorithm and the brute force matching algorithm, where are the difficulties, and what is the relationship with dynamic programming.
The violent string matching algorithm is easy to write. Take a look at its operating logic:
//Brute force matching (pseudocode)
int search (String pat, String txt) {
int M = pat.length;
int N = txt.length;
for ( int i = 0 ; i <NM; i++) {
int j ;
for (j = 0 ; j <M; j++) {
if (pat[j] != txt[i+j])
break ;
}
//pat all match
if (j == M) return i;
}
//pat substring does not exist in txt
return  1 ;
}
Copy code
For brute force algorithms, if there are unmatched characters, fall back at the same time
For example, txt = "aaacaaab" pat = "aaab":
It is clear,
The difference of the KMP algorithm is that it will spend space to record some information, which will be very smart in the above situation:
Another example is the similar txt = "aaaaaaab" pat = "aaab", the violent solution will fall back the pointer as stupidly as the above example
Because the KMP algorithm knows that the character a before the character b is matched, it only needs to compare whether the character b is matched each time.
KMP algorithm never rolls back
The difficulty of the KMP algorithm is how to calculate
One more thing to be clear is: calculate this
Specifically, such as the two examples cited above:
= the txt1 "aaacaaab"
PAT = "AAAB"
txt2 = "aaaaaaab"
PAT = "AAAB"
copy the code
our
Just for
PS: this
And for
understood
public class KMP {
private int [][] dp;
private String pat;
public KMP (String pat) {
this .pat = pat;
//Constructing a dp array through pat
//It takes O(M) time
}
public int search (String txt) {
//Use dp array to match txt
//It takes O(N) time
}
}
Copy code
In this way, when we need to use the same
KMP = the KMP new new the KMP ( "AAAB" );
int POS1 = kmp.search ( "aaacaaab" ); //. 4
int POS2 = kmp.search ( "aaaaaaab" ); //. 4
copy the code
2. state machine overview
Why is the KMP algorithm related to the state machine? That's it, we can think
As shown in the figure above, the number in the circle is the state, state 0 is the starting state, and state 5 (
In addition, when in a different state,
What do you mean specifically, let's take a look at one by one. Use variables
If the character "A" is encountered, it is the smartest to move to state 3 according to the arrow indication:
If you encounter the character "B", according to the arrow, you can only transition to state 0 (return to before liberation overnight):
If the character "C" is encountered, according to the arrow, it should transition to the end state 5, which means that the match is complete:
Of course, other characters may be encountered, such as Z, but obviously it should be transferred to the starting state 0, because
For the sake of clarity here, when we draw the state diagram, we omit the arrow that transfers other characters to state 0, and only draw
The most critical step of the KMP algorithm is to construct this state transition diagram. To determine the behavior of state transition, two variables must be clarified, one is the current matching state, and the other is the character encountered ; after determining these two variables, you can know which state should be transitioned to in this case.
Let's take a look at the KMP algorithm matching strings according to this state transition diagram
Please remember this GIF matching process, this is the core logic of the KMP algorithm !
In order to describe the state transition diagram, we define a twodimensional dp array, its meaning is as follows:
dp[j][c] = next
0 <= j <M, representing the current state
0 <= c < 256 , representing the characters encountered (ASCII code)
0 <= next <= M, representing the next state
dp[ 4 ][ 'A' ] = 3 means:
The current state is 4 , if the character A is encountered,
pat should transition to state 3
dp[ 1 ][ 'B' ] = 2 means:
The current state is 1 , if the character B is encountered,
pat should transition to state 2
duplicated code
According to the definition of our dp array and the state transition process just now, we can first write the search function code of the KMP algorithm:
public int search (String txt) {
int M = pat.length();
int N = txt.length();
//the initial state of pat is 0
int j = 0 ;
for ( int i = 0 ; i <N; i++) {
//The current state is j, and the character txt[i] is encountered,
//which state should pat transition to?
j = dp[j][txt.charAt(i)];
//If the end state is reached, return the index at the beginning of the match
if (j == M) return iM + 1 ;
}
//Did not reach the termination state, the match failed
return  1 ;
}
Copy code
Up to this point, it should be easy to understand.
3. build a state transition diagram
Recall what I just said: To determine the behavior of state transition, two variables must be clarified, one is the current matching state, and the other is the character encountered , and we have determined it according to this logic.
for 0 <= j <M: # state
for 0 <= c < 256 : # character
dp[j][c] = next
Copy code
How should I ask for this next status? Obviously, if the characters encountered
If character
So, how do you know in which state to restart? Before answering this question, let's define another name: shadow state (the name I made up), using variables
Current state
For example, in the situation just now, if the status
Why is this all right? Because: since
Of course, if the character encountered is "B", the state
You may ask, this
In this way, we will refine the framework code just now:
int X # Shadow state
for 0 <= j <M:
for 0 <= c < 256 :
if c == pat[j]:
# State advancement
dp[j][c] = j + 1
else :
# State restart
# Commission X to calculate the restart position
dp[j][c] = dp[X][c]
Copy code
4. code implementation
If you can understand the previous content, congratulations, now there is one question left: shadow state
public class KMP {
private int [][] dp;
private String pat;
public KMP (String pat) {
this .pat = pat;
int M = pat.length();
//dp[state][char] = next state
dp = new int [M][ 256 ];
//base case
dp[ 0 ][pat.charAt( 0 )] = 1 ;
//The shadow state X is initially 0
int X = 0 ;
//The current state j starts from 1
for ( int j = 1 ; j <M; j++) {
for ( int c = 0 ; c < 256; c++) {
if (pat.charAt(j) == c)
dp[j][c] = j + 1 ;
else
dp[j][c] = dp[X][c];
}
//update shadow status
X = dp[X][pat.charAt(j)];
}
}
public int search (String txt) {...}
}
Copy code
First explain this line of code:
//base case
dp[ 0 ][pat.charAt( 0 )] = 1 ;
copy the code
This line of code is the base case. Only when the character pat[0] is encountered can the state transition from 0 to 1. If other characters are encountered, the state will remain at state 0 (the Java default initialization array is all 0).
Shadow state
int X = 0 ;
for ( int j = 1 ; j <M; j++) {
...
//Update the shadow state
//The current state is X, and the character pat[j] is encountered,
//Which state should pat transition to?
X = dp[X][pat.charAt(j)];
}
Copy code
Update
int j = 0 ;
for ( int i = 0 ; i <N; i++) {
//The current state is j, and the character txt[i] is encountered,
//which state should pat transition to?
j = dp[j][txt.charAt(i)];
...
}
Copy code
The principle is very subtle . Pay attention to the initial value of the variable in the for loop in the code. It can be understood like this: the latter is in
In addition, the construction of the dp array is based on the base case
Let's take a look at the complete construction process of the state transition diagram, you can understand the state
At this point, the core of the KMP algorithm is finally finished! Take a look at the complete code of the KMP algorithm:
public class KMP {
private int [][] dp;
private String pat;
public KMP (String pat) {
this .pat = pat;
int M = pat.length();
//dp[state][char] = next state
dp = new int [M][ 256 ];
//base case
dp[ 0 ][pat.charAt( 0 )] = 1 ;
//The shadow state X is initially 0
int X = 0 ;
//Construct a state transition graph (slightly modified and more compact)
for ( int j = 1 ; j <M; j++) {
for ( int c = 0 ; c <256 ; c++) {
dp[j][c] = dp[X][c];
dp[j][pat.charAt(j)] = j + 1 ;
//update the shadow state
X = dp[X][pat.charAt(j)];
}
}
public int search (String txt) {
int M = pat.length();
int N = txt.length();
//the initial state of pat is 0
int j = 0 ;
for ( int i = 0 ; i <N; i++) {
//Calculate the next state of pat
j = dp[j][txt.charAt(i)];
//Reach the end state, return the result
if (j == M) return iM + 1 ;
}
//Did not reach the termination state, the match failed
return  1 ;
}
}
Copy code
After the previous detailed examples, you should be able to understand the meaning of this code. Of course, you can also write the KMP algorithm as a function. The core code is the part of the for loop in the two functions. Is there more than ten lines in the count?
5. the final summary
The traditional KMP algorithm uses a onedimensional array
in
For a pattern string
After clarifying its meaning, you can easily write the code of the search function.
For how to build this
In the current state of the build
For the shadow state
The KMP algorithm is also a matter of dynamic programming, and it is based on a set of frameworks. It is nothing more than describing the logic of the problem and making it clear
.
in
For a pattern string
After clarifying its meaning, you can easily write the code of the search function.
For how to build this
In the current state of the build
For the shadow state
The KMP algorithm is also a matter of dynamic programming, and it is based on a set of frameworks. It is nothing more than describing the logic of the problem and making it clear