KMP
问题引入
给定 2 个字符串 text 和 pattern,在 text 中寻找是否存在 pattern 这个子串(连续)
传统方法 · 暴力
做法
从第一位开始,对两个字符串进行逐位比对, 不相等则从下一位开始逐位比对
复杂度 · O(mn)
简介
KMP 算法是一种改进的字符串匹配算法,由 D.E.Knuth,J.H.Morris 和 V.R.Pratt 提出的,因此人们称它为克努特—莫里斯—普拉特操作(简称 KMP 算法)。KMP 算法的核心是利用匹配失败后的信息,尽量减少模式串与主串的匹配次数以达到快速匹配的目的。具体实现就是通过一个 next() 函数实现,函数本身包含了模式串的局部匹配信息。
KMP 算法是三位学者在 Brute-Force 算法的基础上同时提出的模式匹配的改进算法。Brute- Force 算法在模式串中有多个字符和主串中的若干个连续字符比较都相等,但最后一个字符比较不相等时,主串的比较位置需要回退。KMP 算法在上述情况下,主串位置不需要回退,从而可以大大提高效率
思想
不难发现,暴力的方法进行了大量的无效比对。如以下这个例子:text: “aaaaaaaaab”, pattern: “aaaab”
每次都对比到第 n 位才发现不相等,后移一 位后又从第一位开始比对,因此可以创造 一种算法,先对 pattern 字符串进行预处理,标记上前缀与后缀相等的长度,之后再进行比对时,当比对到不相等的位置时,由于已知前缀与后缀相等的长度,因此可以跳过这一段,比对之后的串即可。
例子中, 当比对第 1 位时,
| pos | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 
|---|---|---|---|---|---|---|---|---|---|---|
| text | a | a | a | a | a | a | a | a | a | b | 
| pattern | a | a | a | a | b | 
前 4 位都相等,到了第 5 位时不相等,而对 pattern 串,除去不相等的第 5 位,前 4 位中的前缀与后缀的相等长度为 2(1, 2与 3, 4相等),而经过比比对,3, 4 位已经与 text 对应相等,因此可以把第 1, 2 位直接移到此位置即可,减少无效比对。
| pos | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 
|---|---|---|---|---|---|---|---|---|---|---|
| text | a | a | a | a | a | a | a | a | a | b | 
| pattern | a | a | a | a | b | 
做法
- 对 pattern 进行预处理得到 pattern 的前缀表 prefix table - 逐位判断前后缀相等长度,保存在 prefix 数组 - 将 prefix 数组整体后移 1 位,prefix[0] = -1 
- KMP 
 两串相等继续;不相等则将 pattern 的指针 j 指向 pattern[j]
复杂度 · O(m + n)
模板
| 1 | 
 | 
text: 目标字符串
pattern: 匹配字符串
101466E · Text Editor
time limit per test: 1 second
memory limit per test: 512 megabytes
input: standard input
output: standard output
Description
One of the most useful tools nowadays are text editors, their use is so important that the Unique Natural Advanced Language (UNAL) organization has studied many of the benefits working with them.
They are interested specifically in the feature “find”, that option looks when a pattern occurs in a text, furthermore, it counts the number of times the pattern occurs in a text. The tool is so well designed that while writing each character of the pattern it updates the number of times that the corresponding prefix of the total pattern appears on the text.
Now the UNAL is working with the editor, finding patterns in some texts, however, they realize that many of the patterns appear just very few times in the corresponding texts, as they really want to see more number of appearances of the patterns in the texts, they put a lower bound on the minimum number of times the pattern should be found in the text and use only prefixes of the original pattern. On the other hand, the UNAL is very picky about language, so they will just use the largest non-empty prefix of the original pattern that fit into the bound.
Input
The first line contains the text A (1 ≤ |A| ≤ 105) The second line contains the original pattern B (1 ≤ |B| ≤ |A|) The third line contains an integer n (1 ≤ n ≤ |A|) - the minimum number of times a pattern should be found on the text.
Output
A single line, with the prefix of the original pattern used by the UNAL, if there is no such prefix then print “IMPOSSIBLE” (without the quotes)
Examples
Input 1
| 1 | aaaaa | 
Output 1
| 1 | aa | 
Input 2
| 1 | programming | 
Output 2
| 1 | IMPOSSIBLE | 
Input 3
| 1 | abracadabra | 
Output 3
| 1 | abra | 
Input 4
| 1 | Hello World! | 
Output 4
| 1 | IMPOSSIBLE | 
Solution
把 B 字符串对 A 字符串进行 KMP。
但是如果单纯的从长到短 KMP 的话复杂度为 O(n2),因此要对长度进行二分。
Accepted Code
| 1 | 
 | 
About this Post
This post is written by OwlllOvO, licensed under CC BY-NC 4.0.