Skip to main content

String Searching Algorithms (Part III)

Posted by potty on May 21, 2012 at 8:31 AM PDT

Boyer-Moore (BM)

Overview

According to Wikipedia, the Boyer-Moore algorithm "is an efficient string searching algorithm that is the standard benchmark for practical string search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the text does not persist across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm runs faster as the pattern length increases".



The BM algorithm have two shifting functions: bad character rule (occurrence shift) and the good suffix rule (matching shift):

  • The bad character rule keeps information about how pattern matches against shifts of itself. This information is to avoid useless shifts of the pattern.


  • Case #1. If mismatch is within pattern.


    Case #2. If mismatch is not within pattern.

  • The good suffix rule uses the prefix function, the pattern and the text as inputs to find the occurrence of the pattern within the text and returns the number of shifts after the first occurrence.


In conclusion, the BM algorithm has the following characteristics:

  • The algorithm searches for a class of patterns given by a pattern description instead of searching for a fixed pattern.
  • The algorithm compares from right to left.
  • The algorithm preprocesses the pattern.
  • The preprocessing phase time efficiency is Θ(m+σ).
  • The searching phase time efficiency is Θ(m*n).
  • The algorithm accomplish at most 3n text character comparisons.

Advantages

  • Most efficient algorithm to search for a pattern inside a string.
  • It is used in many search and replace operations in text editors.

Disadvantages

  • Not simple.
  • The preprocessing for the good suffix rule is difficult to understand and implement.

Java Code

public class BoyerMoore {
    private final int BASE;
    private int[] occurrence;
    private String pattern;

    public BoyerMoore(String pattern) {
        this.BASE = 256;
        this.pattern = pattern;

        occurrence = new int[BASE];
        for (int c = 0; c < BASE; c++)
            occurrence[c] = -1;
        for (int j = 0; j < pattern.length(); j++)
            occurrence[pattern.charAt(j)] = j;
    }

    public int search(String text) {
    int n = text.length();
    int m = pattern.length();
        int skip;
        for (int i = 0; i <= n - m; i += skip) {
            skip = 0;
            for (int j = m-1; j >= 0; j--) {
                if (pattern.charAt(j) != text.charAt(i+j)) {
                    skip = Math.max(1, j - occurrence[text.charAt(i+j)]);
                    break;
                }
            }
            if (skip == 0) return i;
        }
        return n;
    }
}


public class Test {

public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet";
String pattern = "ipsum";
BoyerMoore bm = new BoyerMoore(pattern);

int first_occur_position = bm.search(text);
System.out.println("The text '" + pattern + "' is first found after the "
                                    + first_occur_position + " position.");
}

}

References

  1. Charras, Christian & Lecroq, Thierry. Exact String Matching Algorithms. Rouen University. France. 1997. Link.
  2. Wikipedia, the free encyclopedia (User: Watcher). Boyer-Moore string search algorithm. May 28, 2004. Link.
  3. Bustos, Benjamín . Búsqueda de Texto. Chile. 2004. Link.
AttachmentSize
boyermoore.zip3.8 KB
boyermoore.png8.56 KB
bm_goodshift.png3.31 KB
bm_badshift1.png4.35 KB
bm_badshift2.png3.42 KB

Comments

Hello, Is a way to make the searching case insensitive.

Hello,
Is a way to make the searching case insensitive.

Interesting post. Great !

Interesting post. Great !

This is very interesting and informational. Added to my ...

This is very interesting and informational. Added to my bookmark. Thanks!