Regular Expression Matching Task
A Regular Expression Matching Task is a Pattern Matching Task that finds match relations between regular expressions.
- AKA: Regular Expression Matching.
- Context:
- It can be solved by a Regular Expression Matching System by implementing a Regular Expression Matching Algorithm.
- Example(s):
- an Erlang PCRE Regular Expression Marching Task [1],
- a Trigram Index Regular Expression Matching (RE2) Task (Cox, 2012),
- a POSIX Regular Expression Matching Task [2]
- a Ternary Content Addressable Memory (TCAM) Based Regular Expression Matching Task (Meiners et al., 2010)
- a Sparse Regular Expression Matching Task (Bille & Gortz, 2019).
- …
- Counter-Example(s):
- See: Natural Language Processing Task, Pattern Matching System, Pattern Matching Algorithm, String Matching Algorithm.
References
2019a
- (Bille & Gortz, 2019) ⇒ Philip Bille, and Inge Li Gortz. (2019). “Sparse Regular Expression Matching.” In: arXiv:1907.04752.
- QUOTE: A regular expression [math]\displaystyle{ R }[/math] specifies a set of strings formed by characters from an alphabet [math]\displaystyle{ \Sigma }[/math] combined with concatenation ([math]\displaystyle{ \odot }[/math]), union ([math]\displaystyle{ | }[/math]), and Kleene star ([math]\displaystyle{ \star }[/math]) operators. For example, [math]\displaystyle{ (a|(b \odot a))^{\star} }[/math] describes the set of strings of as and bs such that every b is followed by an a (by convention will we often omit the [math]\displaystyle{ \odot }[/math] in the rest of the paper).
2019b
- (PostgreSQL Doc., 2019) ⇒ https://www.postgresql.org/docs/9.3/functions-matching.html Retrieved:2019-09-17
- QUOTE: Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs (roughly those of egrep), and basic REs or BREs (roughly those of ed). PostgreSQL supports both forms, and also implements some extensions that are not in the POSIX standard, but have become widely used due to their availability in programming languages such as Perl and Tcl. REs using these non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an exact superset of EREs, but BREs have several notational incompatibilities (as well as being much more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs, and then describe how BREs differ.
A regular expression is defined as one or more branches, separated by |. It matches anything that matches one of the branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom(...)
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier.
- QUOTE: Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs (roughly those of egrep), and basic REs or BREs (roughly those of ed). PostgreSQL supports both forms, and also implements some extensions that are not in the POSIX standard, but have become widely used due to their availability in programming languages such as Perl and Tcl. REs using these non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an exact superset of EREs, but BREs have several notational incompatibilities (as well as being much more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs, and then describe how BREs differ.
2012
- (Cox, 2012) ⇒ Russ Cox (January, 2012). "Regular Expression Matching with a Trigram Index or How Google Code Search Worked"
- QUOTE: Google open sourced the regular expression engine I wrote for Code Search, RE2, in March 2010. Code Search and RE2 have been a great vehicle for educating people about how to do regular expression search safely. In fact, Tom Christiansen recently told me that even people in the Perl community use it (
perl -Mre::engine::RE2
), to run regexp search engines (the real kind) on the web without opening themselves up to trivial denial of service attacks(...) Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.
- QUOTE: Google open sourced the regular expression engine I wrote for Code Search, RE2, in March 2010. Code Search and RE2 have been a great vehicle for educating people about how to do regular expression search safely. In fact, Tom Christiansen recently told me that even people in the Perl community use it (
2010
- (Meiners et al., 2010) ⇒ Chad R. Meiners, Jignesh Patel, Eric Norige, Eric Torng, and Alex X. Liu. (2010). “Fast Regular Expression Matching Using Small TCAMs for Network Intrusion Detection and Prevention Systems.” In: Proceedings of the 19th USENIX conference on Security. ISBN:888-7-6666-5555-4
- QUOTE: Deep packet inspection is a key part of many networking devices on the Internet such as Network Intrusion Detection (or Prevention) Systems (NIDS/NIPS), firewalls, and layer 7 switches. In the past, deep packet inspection typically used string matching as a core operator, namely examining whether a packet’s payload matches any of a set of predefined strings. Today, deep packet inspection typically uses regular expression (RE) matching as a core operator, namely examining whether a packet’s payload matches any of a set of predefined regular expressions, because REs are fundamentally more expressive, efficient, and flexible in specifying attack signatures (...)
To address the limitations of prior art on high-speed RE matching, we propose the first Ternary Content Addressable Memory (TCAM) based RE matching solution.
- QUOTE: Deep packet inspection is a key part of many networking devices on the Internet such as Network Intrusion Detection (or Prevention) Systems (NIDS/NIPS), firewalls, and layer 7 switches. In the past, deep packet inspection typically used string matching as a core operator, namely examining whether a packet’s payload matches any of a set of predefined strings. Today, deep packet inspection typically uses regular expression (RE) matching as a core operator, namely examining whether a packet’s payload matches any of a set of predefined regular expressions, because REs are fundamentally more expressive, efficient, and flexible in specifying attack signatures (...)
2002
- (Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
- QUOTE: Definition A regular expression RE is a string on the set of symbol [math]\displaystyle{ \Sigma \cup \{\varepsilon, |, \cdot, *, (,),\} }[/math], which recursively defined as the empty character [math]\displaystyle{ \varepsilon }[/math], a character [math]\displaystyle{ \alpha \in \Sigma }[/math]; and [math]\displaystyle{ (RE_1), \; (RE_1\cdot RE_2),\;(RE_1| RE_2) }[/math], where [math]\displaystyle{ RE_1 }[/math], and [math]\displaystyle{ R|E_2 }[/math] are regular expressions.