ACM
April 12, 2021

KMP

KMP

问题引入

给定 2 个字符串 text 和 pattern,在 text 中寻找是否存在 pattern 这个子串(连续)

传统方法 · 暴力

做法

从第一位开始,对两个字符串进行逐位比对, 不相等则从下一位开始逐位比对

复杂度 · O(mn)

简介

KMP 算法是一种改进的字符串匹配算法,由 D.E.Knuth,J.H.Morris 和 V.R.Pratt 提出的,因此人们称它为克努特—莫里斯—普拉特操作(简称 KMP 算法)。KMP 算法的核心是利用匹配失败后的信息,尽量减少模式串与主串的匹配次数以达到快速匹配的目的。具体实现就是通过一个 next() 函数实现,函数本身包含了模式串的局部匹配信息。

KMP 算法是三位学者在 Brute-Force 算法的基础上同时提出的模式匹配的改进算法。Brute- Force 算法在模式串中有多个字符和主串中的若干个连续字符比较都相等,但最后一个字符比较不相等时,主串的比较位置需要回退。KMP 算法在上述情况下,主串位置不需要回退,从而可以大大提高效率

思想

不难发现,暴力的方法进行了大量的无效比对。如以下这个例子:text: “aaaaaaaaab”, pattern: “aaaab”
每次都对比到第 n 位才发现不相等,后移一 位后又从第一位开始比对,因此可以创造 一种算法,先对 pattern 字符串进行预处理,标记上前缀与后缀相等的长度,之后再进行比对时,当比对到不相等的位置时,由于已知前缀与后缀相等的长度,因此可以跳过这一段,比对之后的串即可。

例子中, 当比对第 1 位时,

pos 1 2 3 4 5 6 7 8 9 10
text a a a a a a a a a b
pattern a a a a b

前 4 位都相等,到了第 5 位时不相等,而对 pattern 串,除去不相等的第 5 位,前 4 位中的前缀与后缀的相等长度为 2(1, 2与 3, 4相等),而经过比比对,3, 4 位已经与 text 对应相等,因此可以把第 1, 2 位直接移到此位置即可,减少无效比对。

pos 1 2 3 4 5 6 7 8 9 10
text a a a a a a a a a b
pattern a a a a b

做法

  1. 对 pattern 进行预处理得到 pattern 的前缀表 prefix table

    逐位判断前后缀相等长度,保存在 prefix 数组

    将 prefix 数组整体后移 1 位,prefix[0] = -1

  2. KMP
    两串相等继续;不相等则将 pattern 的指针 j 指向 pattern[j]

复杂度 · O(m + n)

模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
#define _CRTSECURE_NOWARNINGS
#pragma warning(disable:4996)
#include<cstdio>
#include<cstring>
#include<string>
#include<cmath>
#include<cctype>
#include<algorithm>
#include<iostream>
#include<queue>
#include<stack>
#include<vector>
#include<map>
#include<set>
#include<list>
using namespace std;

const int MAX_LEN = 10000017;

int lenpattern, lentext;
char pattern[MAX_LEN];
char text[MAX_LEN];
int prefix[MAX_LEN];

void prefix_table()
{
prefix[0] = 0;
int len = 0;
int i = 1;
while (i < lenpattern)
{
if (pattern[i] == pattern[len])
{
len++;
prefix[i] = len;
i++;
}
else
{
if (len > 0) len = prefix[len - 1];
else
{
prefix[i] = len;
i++;
}
}
}
}

void move_prefix_table()
{
for (int i = lenpattern - 1; i > 0; i--) prefix[i] = prefix[i - 1];
prefix[0] = -1;
}

void kmp_search()
{
prefix_table();
move_prefix_table();
int i = 0, j = 0;
while (i < lentext)
{
if (j == lenpattern - 1 && text[i] == pattern[j])
{
printf("Found pattern at %d\n", i - j);
j = prefix[j];
if (j == -1)
{
i++;
j++;
}
}
else
{
if (text[i] == pattern[j])
{
i++;
j++;
}
else
{
j = prefix[j];
if (j == -1)
{
i++;
j++;
}
}
}
}
}

int main()
{
scanf("%s", text);
scanf("%s", pattern);
lentext = strlen(text);
lenpattern = strlen(pattern);

kmp_search();


return 0;

}

text: 目标字符串

pattern: 匹配字符串

101466E · Text Editor

time limit per test: 1 second

memory limit per test: 512 megabytes

input: standard input

output: standard output

Description

One of the most useful tools nowadays are text editors, their use is so important that the Unique Natural Advanced Language (UNAL) organization has studied many of the benefits working with them.

They are interested specifically in the feature “find”, that option looks when a pattern occurs in a text, furthermore, it counts the number of times the pattern occurs in a text. The tool is so well designed that while writing each character of the pattern it updates the number of times that the corresponding prefix of the total pattern appears on the text.

Now the UNAL is working with the editor, finding patterns in some texts, however, they realize that many of the patterns appear just very few times in the corresponding texts, as they really want to see more number of appearances of the patterns in the texts, they put a lower bound on the minimum number of times the pattern should be found in the text and use only prefixes of the original pattern. On the other hand, the UNAL is very picky about language, so they will just use the largest non-empty prefix of the original pattern that fit into the bound.

Input

The first line contains the text A (1 ≤ |A| ≤  105) The second line contains the original pattern B (1 ≤ |B| ≤  |A|) The third line contains an integer n (1 ≤ n ≤  |A|) - the minimum number of times a pattern should be found on the text.

Output

A single line, with the prefix of the original pattern used by the UNAL, if there is no such prefix then print “IMPOSSIBLE” (without the quotes)

Examples

Input 1

1
2
3
aaaaa
aaa
4

Output 1

1
aa

Input 2

1
2
3
programming
unal
1

Output 2

1
IMPOSSIBLE

Input 3

1
2
3
abracadabra
abra
1

Output 3

1
abra

Input 4

1
2
3
Hello World!
H W
5

Output 4

1
IMPOSSIBLE

Solution

把 B 字符串对 A 字符串进行 KMP。

但是如果单纯的从长到短 KMP 的话复杂度为 O(n2),因此要对长度进行二分。

Accepted Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
#define _CRTSECURE_NOWARNINGS
#pragma warning(disable:4996)
#include<cstdio>
#include<cstring>
#include<string>
#include<cmath>
#include<cctype>
#include<algorithm>
#include<iostream>
#include<queue>
#include<stack>
#include<vector>
#include<map>
#include<set>
#include<list>
using namespace std;
const int MAX_LEN = 10000017;

int n;
int lenpattern, lentext;
char pattern[MAX_LEN];
char text[MAX_LEN];
int prefix[MAX_LEN];

void prefix_table()
{
prefix[0] = 0;
int len = 0;
int i = 1;
while (i < lenpattern)
{
if (pattern[i] == pattern[len])
{
len++;
prefix[i] = len;
i++;
}
else
{
if (len > 0) len = prefix[len - 1];
else
{
prefix[i] = len;
i++;
}
}
}
}

void move_prefix_table()
{
for (int i = lenpattern; i > 0; i--) prefix[i] = prefix[i - 1];
prefix[0] = -1;
}

int kmp_search(int x)
{
int cnt = 0;
int i = 0, j = 0;
while (i < lentext)
{
if (j == x - 1 && text[i] == pattern[j])
{
cnt++;
j = prefix[j];
if (j == -1)
{
i++;
j++;
}
}
else
{
if (text[i] == pattern[j])
{
i++;
j++;
}
else
{
j = prefix[j];
if (j == -1)
{
i++;
j++;
}
}
}

}
return cnt;
}

int main()
{
freopen("in.txt", "r", stdin);

gets_s(text);
gets_s(pattern);
scanf("%d", &n);
lenpattern = strlen(pattern);
lentext = strlen(text);

prefix_table();
move_prefix_table();

int l = 0, r = lenpattern, mid, ans = 0;
while (l <= r)
{
int mid = (l + r) / 2;
if (kmp_search(mid) >= n)
{
ans = mid;
l = mid + 1;
}
else r = mid - 1;
}
if (ans)
{
for (int i = 0; i < ans; i++) printf("%c", pattern[i]);
}
else printf("IMPOSSIBLE");


return 0;

}

About this Post

This post is written by OwlllOvO, licensed under CC BY-NC 4.0.

#C++#KMP