Aho-Corasick String Matching Algorithm

An Aho-Corasick String Matching Algorithm is a Finite-Pattern Set Exact String Matching Algorithm that first builds a state machine from the lexical patterns (in the dictionary) and then searches the string.

AKA: Aho–Corasick Algorithm.
Context:
- It matches all patterns simultaneously.
- …
Counter-Example(s):
- a Rabin-Karp String Matching Algorithm.
See: Boyer-Moore String Search Algorithm, Trie Data Structure, Grep, Finite-State Machine, String-Searching Algorithm.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Aho–Corasick_algorithm Retrieved:2023-5-18.
- In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = , , , and input string is ).
  Informally, the algorithm constructs a finite-state machine that resembles a trie with additional links between the various internal nodes. These extra internal links allow fast transitions between failed string matches (e.g. a search for in a trie that does not contain , but contains , and thus would fail at the node prefixed by ), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch for might be the best lateral transition). This allows the automaton to transition between string matches without the need for backtracking.
  When the string dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.
  The Aho–Corasick string-matching algorithm formed the basis of the original Unix command fgrep.

2012

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm
- QUOTE: … In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).

2010

http://search.cpan.org/~vbar/Algorithm-AhoCorasick-0.03/lib/Algorithm/AhoCorasick.pm
- Aho-Corasick is a classic (1975) algorithm for locating elements of a finite set of strings within an input text. It constructs a finite state machine from a list of keywords, then uses the machine to locate all occurrences of the keywords. Construction of the machine takes time proportional to the sum of the lengths of the keywords and the machine processes the input string in a single pass - that is, the algorithm may be considerably more efficient than searching for each keyword separately.

2005

http://www.codeproject.com/Articles/12383/Aho-Corasick-string-matching-in-C
- QUOTE: The algorithm consists of two parts. The first part is the building of the tree from keywords you want to search for, and the second part is searching the text for the keywords using the previously built tree (state machine). Searching for a keyword is very efficient, because it only moves through the states in the state machine. If a character is matching, it follows goto function otherwise it follows fail function.

1975

(Aho & Corasick, 1975) ⇒ Alfred V. Aho, and Margaret J. Corasick. (1975). “Efficient String Matching: An aid to bibliographic search.” In: Communications of the ACM, 18(6). doi:10.1145/360825.360855
- QUOTE: This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
- Keywords and Phrases: keywords and phrases, string pattern matching, bibliographic search, information retrieval, text-editing, finite state machines, computational complexity.

Aho-Corasick String Matching Algorithm

References

2023

2012

2010

2005

1975

Navigation menu

Search