字符串匹配算法 | 小熊の小站

0 概述

本实验使用C++语言实现了Sunday和KMP两种字符串匹配算法。

1 实验设计

1.1 算法实现原理

相比朴素的暴力匹配算法，Sunday算法和KMP算法的性能提升有很大的提升，是一种无回溯的算法。

不会反复比较已经比较过的内容，只会比较未比较过的内容，这样可以大大提高算法的效率。

1.2 算法设计与分析

Sunday

通过创建shift数组，计算匹配串中出现的字母距离模式串最后的位置，即是匹配时往后移动的距离。如果没有出现，则默认移动模式串的长度。

int shift[130]; // shift[i]表示i字符距离最后的长度，即需要往后移动的距离
for (i = 0; i <= 128; i++) {
    shift[i] = m + 1;
}
for (i = 0; i < m; i++) {
    shift[pattern[i]] = m - i;
}

在shift数组计算后，即可开始匹配过程。

如果发生失配，则往后移动下一个字母在模式串出现的距离最后的距离（shift已计算）。

如果下一个字母在模式串中没有出现，则移动模式串整体的长度。

一直匹配，直到模式串与匹配串中的子串完全匹配。

int sunday(char *str, char *pattern) {
    int n = strlen(str), m = strlen(pattern);
    int shift[130]; // shift[i]表示i字符距离最后的长度，即需要往后移动的距离
    int i, j;
    for (i = 0; i <= 128; i++) {
        shift[i] = m + 1;
    }
    for (i = 0; i < m; i++) {
        shift[pattern[i]] = m - i;
    }
    int s = 0;
    while (s <= n - m) {
        j = 0;
        while (str[s + j] == pattern[j]) {
            j++;
            // 匹配成功
            if (j >= m) {
                return s;
            }
        }
        // 后移
        s += shift[str[s + m]];
    }
    return -1;
}

KMP

创建next数组，计算公共前缀长度。next数组仅与模式串有关，与匹配串无关。

int next[m];
next[0] = -1;
for (i = 1; i < m; i++) {
    j = next[i - 1];
    while (j != -1 && pattern[j + 1] != pattern[i]) {
        j = next[j];
    }
    if (pattern[j + 1] == pattern[i]) {
        next[i] = j + 1;
    } else {
        next[i] = -1;
    }
}

前缀函数

举例来说，对于字符串 abcabcd

next[0] = -1，因为 a 没有真前缀和真后缀，根据规定为 0

next[1] = -1，因为 ab 无相等的真前缀和真后缀

next[2] = -1，因为 abc 无相等的真前缀和真后缀

next[3] = 0，因为 abca 只有一对相等的真前缀和真后缀：a，长度为 1

next[4] = 1，因为 abcab 相等的真前缀和真后缀只有 ab，长度为 2

next[5] = 2，因为 abcabc 相等的真前缀和真后缀只有 abc，长度为 3

next[6] = -1，因为 abcabcd 无相等的真前缀和真后缀

同理可以计算字符串 aabaaab 的next函数为 [-1, 0, -1, 0, 1, 1, 2]。

因为next函数代表往后移动的距离，所以是前缀函数的值 - 1

在next函数计算完成后，即可开始匹配。匹配方式与Sunday算法类似，只是在失配时，需要往后移动next函数中的值。

int kmp(char *str, char *pattern) {
    int n = strlen(str), m = strlen(pattern);
    int i, j;
    int next[m];
    // 计算next函数
    next[0] = -1;
    for (i = 1; i < m; i++) {
        j = next[i - 1];
        while (j != -1 && pattern[j + 1] != pattern[i]) {
            j = next[j];
        }
        if (pattern[j + 1] == pattern[i]) {
            next[i] = j + 1;
        } else {
            next[i] = -1;
        }
    }
    int s = 0;
    while (s <= n - m) {
        j = 0;
        while (str[s + j] == pattern[j]) {
            j++;
            // 匹配成功
            if (j >= m) {
                return s;
            }
        }
        // 后移
        if (next[j] == -1) {
            s += 1;
        } else {
            s += next[j];
        }
    }
    return -1;
}

KMP算法在模式串中有重复的序列时，节省时间的效果更加显著。

将Sunday和KMP算法进行结合，可以获得更好的效果。

2 实验结果说明与分析

输入两行，分别是匹配串与模式串。

1 2	ababac abab

输出两行，分别是Sunday和KMP算法的匹配结果。返回值是模式车在匹配串中第一次出现的位置。-1表示没有匹配。

1
2

0
0

代码在homework4/stringMatch.cpp。

代码开源在：https://gitee.com/littlebear0729/data-structure-and-algorithm/tree/master/homework4