Home

GEONius.com
5-Jul-2021
E-mail

rex_util - Regular Expression Utilities

The REX_UTIL package provides simple, regular expression (RE) pattern matching and substitution capabilities. This package is similar to the regcmp(3) package (from the System V Programmers Workbench), but regcmp(3) isn't available on all systems.

And in 2021: I originally wrote these utilities back in late 1989 or early 1990. The original syntax for regular expression matching and replacement was influenced by my experience with lex(1) and yacc(1) and by examination of other implementations.

30 years later, a few years of good experience with the Sigil EPUB editor's regular expression find and replace features motivated me to finally upgrade this package. First of all, I made the parser fully reentrant (thanks to GNU's Bison) and the matching and replacement functions fully reentrant.

The original search function to match a regular expression against text, rexSearch(), was intentionally written in two versions: (i) a fully recursive function implemented as a model for (ii) an iterative function that emulated the recursive function with a manually allocated and managed stack. For 30 years, I've always used the iterative version, although I don't remember if I ever compared the performance of the two versions. The code for the iterative version is understandable after some study, but it is unwieldy. In the midst of making these new changes, I switched to the recursive version. I eliminated tail recursion (which had been eliminated in the original iterative version from Day 1), making rexSearch() a hybrid function with both recursion and iteration.

The following changes were made to the regular expression syntax:

and to the substitution syntax:

Subexpressions, subpatterns, capture groups, capturing groups, back references, back-references, and backreferences ... what's in a name? (I am considering adding back-references to the package. Or maybe not.)

A kind of small achievement for me was that the REX_UTIL package now (i) successfully compiles Cal Henderson's PCRE 23k+ regular expression for matching RFC 822 email addresses and (ii) successfully parses my email address! His regular expression has 1,413 non-capture groups and 8 actual capture groups; capture groups 1 and 2 capture my username and capture groups 5 and 6 capture the domain name. With that many groups of both kinds, the regular expression is obviously complex — my compiled RE has 6,554 states. However, the regular expression features needed and used are the basic closures, intervals (curly braces), and character classes (square brackets); the most advanced feature is non-capture groups. Upon reflection, my old REX_UTIL package could have compiled the full expression with some minor changes to the RE. Parenthesized groups were, by definition, non-capture groups; a real capture group had to be explicitly labeled by a suffix identifying the group number (1..9), "(...)$n". So, take Cal Henderson's regular expression and (i) search for each of the 8 capture groups (whose left parenthesis is not followed by "?:") and add "$n" to the closing parenthesis and (ii) change all occurrences of "(?:" to "(". (As the author notes in his text file, RFC 822 was superseded by RFC 2822. RFC 2822 was, in turn, superseded by RFC 5322.)

A more recent NOTE (2015): I wrote my regular expression package back in 1990, I think. We were building a distributed satellite control center (TPOCC) for NASA using 3 platforms: SunOS, HP/UX, and VxWorks. I believe the SunOS RE functions only allowed a single RE at a time; HP/UX had regcmp(3) which allowed multiple compiled REs at one time, although the regex() function has a global variable; VxWorks had no RE functions in its libraries at that time. Consequently, I wrote a single package that worked on all 3 systems and allowed you to compile and use multiple REs simultaneously.

Back then, I was heavily influenced by regular expressions in the lex(1) Lexer Generator and in HP/UX's regcmp(3) functions. Thus the differences between my REs and modern REs. In particular, parentheses, not escaped parentheses, are used to group REs: "(RE1|RE2)" instead of "\(RE1|RE2\)". Matched subexpressions (SE) are numbered specifically, "(...)$2", rather than by the implicit count of the parenthetical groups, "\(SE1\)REn\(SE2\)". Lastly, I don't know how it happened, but I really goofed with the character classes. For example, in my package, character class ":alnum:" is inserted as follows into a bracket expression, "[:alnum:]". The correct syntax for the character class is "[:alnum:]", which then produces a bracket expression, "[[:alnum:]]". I've been using it for 25 years; I should probably update it at some point ...

On the plus side, my package has been built and tested under a bunch of different Unices and Linux, under DOS (yes! by a group in Sweden), under Windows from Windows 95 up through Windows 10 (currently), under VxWorks, under VAX/VMS, and for the Nintendo DS handheld gaming system. (I was going to try it on a Sony PSP, but I couldn't wrench the game system out of my son's hands!) The package was also built for PalmOS, but only tested in the Palm OS Emulator (POSE). (I think I tried testing some networking code on my Palm M105 using its serial port as an "ethernet" connection, but it was too kludgy for serious testing.)

And my regular expression package was recommended for inclusion in Tcl/Tk back in 1996. If I remember correctly, the Tcl/Tk developers went with Henry Spencer's older (or newer?) package. One of the commenters was concerned that my entire C library would have had to be incorporated into Tcl; however, looking back at an RE-only library I whipped up at the time shows that the RE package — at that time — only required the aperror, get_util, and str_util functions.

Older NOTE: This regular expression package is over twenty years old—it still works and I still use it. However, there are newer packages out on the web that you might find more suitable. Three packages that I've come across, but have not used myself, I mention here; there are undoubtedly others.

Pattern Matching

The REX utilities support the following regular expression constructs:

Basic Regular Expression (RE) Symbols:

c
matches character c.
\c
matches character c. Escaping c in this way allows matching a literal character c when c is an RE special symbol. For example, "(" is a special symbol indicating the start of an RE group and doesn't actually match any character itself. However, "\(" has nothing to do with groups and matches a literal left parenthesis in the target text.
.
matches any character.
^
anchors the left edge of the match at the beginning of the target string. If an RE is not anchored, rexMatch() and rexReplace() will scan the entire text string looking for a match. For example, regular expression "abc" would match those 3 characters in both abcxyz and xyzabc. Anchored regular expression "^abc", on the other hand, will only match the first 3 characters in abcxyz; no match will be made in xyzabc.
$
anchors the right edge of the match at the end of the target string. For example, regular expression "abc$" matches those 3 characters in xyzabc, but not in abcxyz.
A regular expression anchored at both ends ("^RE$") must consume the entire target string in order for the match to succeed. Aside from appearing at the very beginning and at the very end of the complete RE, these anchors can also be used with the same meanings in alternation expressions; for example, "^abc|^def$" is equivalent to matching "^abc$" or "^def$".

Quantified Regular Expressions:

RE*
matches zero or more instances of the RE.
RE+
matches one or more instances of the RE.
RE?
matches zero or one instance of the RE.
RE{m}
matches exactly m instances of the RE.
RE{m,n}
matches between m and n instances of the RE.

Concatenation and Alternation:

RE1RE2
matches RE1 followed immediately by RE2 (no intervening spaces in the RE patterns or in the target string).
RE1|RE2
matches RE1 or RE2.

Grouping:

(RE)
allows grouping of RE's. These subexpressions also function as capture groups, the indices of which are automatically assigned according to the order of the left parentheses. The indices begin with 1 and there is no limit on the number of groups, memory permitting.
(?<name>RE)
assigns a name, enclosed in angle brackets, to a capture group. In replacement and substitution, the capture group can be referenced by index or name.
(?:RE)
specifies a non-capture group. The group has no index and cannot be referenced in substitution strings. Its simple purpose is grouping REs.

Special Character Sequences:

\a \b \f \n \r \t \v
matches the standard C control characters: alert, backspace, form feed, new line, carriage return, horizontal tab, and vertical tab, respectively.
\d
matches a decimal digit, '0' to '9'; equivalent to "[0-9]".
\D
matches anything but a decimal digit, '0' to '9'. Equivalent to "[^0-9]".
\xHH
matches the character with the given 2-digit, hexadecimal value.
\uHHHH
inserts and matches the 1 to 4 characters of the UTF-8 representation of Unicode code point U+HHHH. This is for code points in the Basic Multilingual plane, U+0000 through U+FFFF. The code point must be specified as 4 hexadecimal digits.
\UHHHHHH
inserts and matches the 1 to 4 characters of the UTF-8 representation of Unicode code point U+HHHHHH. This is for code points in any of the 17 Unicode planes; the code points range from U+000000 to U+10FFFF. 6 hexadecimal digits must be specified.
\u{HHH...}
specifies a Unicode code point using a variable number of hexadecimal digits and inserts and matches its UTF-8 representation.
NOTE that the Unicode code point specifiers are replaced by the 1- to 4-character UTF-8 representation of the code point. Consequently, the Unicode specifiers should NOT be used in the square-bracketed character classes (immediately below). As an example, the violin emoji, U+1F3BB 🎻, a single code point, is represented in UTF-8 by a sequence of 4 bytes: 0xF0, 0x9F, 0x8E, and 0xBB. If the emoji is specified between square brackets, "[\U01F3BB]", it will be replaced internally by the 4 UTF-8 "characters": "[\xF0\x9F\x8E\xBB]". The latter will match any single one of the UTF-8 "characters", but not the sequence of 4 bytes. Alternation can be used to simulate the desired behavior: "[...whatever...]|\U01F3BB".

Character Classes (Outer Square Brackets):

[...]
matches classes of characters between the square brackets.
[^...]
matches anything but the classes of characters.

Character Classes (Within the Outer Square Brackets):

c-c
matches any character in a range of characters. Any number of these ranges may be specified between the square brackets or between the complementing caret and the right square bracket. For example, "[123a-z45]" matches the numbers '1', '2', '3', '4', and '5' and the lower-case English alphabet (letters 'a' through 'z').
c
matches the specified character; e.g., "[aeiou]" matches a vowel.
[:class:]    (must appear inside [...] or [^...])
matches an ASCII character in the specified class:
        [:alnum:]    [:digit:]    [:space:]
        [:alpha:]    [:graph:]    [:upper:]
        [:ascii:]    [:lower:]    [:xdigit:]
        [:blank:]    [:print:]    [:word:]
        [:cntrl:]    [:punct:]
See the man entries for isalnum(3), etc. for a description of these classes. NOTE that these brackets are inner square brackets and must appear inside the outer square brackets; e.g.,
[^A-Z[:lower:]@#![:digit:]]
will NOT match upper- ("A-Z") or lower-case ("[:lower:]") alphabetic characters, punctuation marks @, #, and ! ("@#!"), or decimal digits ("[:digit:]").

Pattern-Directed Text Substitution

The REX_UTIL package's text substitution functions, rexReplace() and rexSubstitute(), support the following special character sequences in replacement text.

Special Character Sequences in Replacement Text:

\M  or  $M
\<M>  or  $<M>
insert subexpression M (in the range 1..N) if it was matched by the RE; enclose multi-digt numbers (10 and up) in angle brackets.
\<name>
$<name>
insert the named group if it was matched by the RE; enclose names in angle brackets.
\&  or  $&
\0  or  $0
insert the entire text matched by the RE.
\`  or  $`
insert the text preceding the match.
\'  or  $'
insert the text following the match.
\l...  or  $l...
insert the numbered or named group matched by the RE, converted to lower case. For example, "\lM", "$l<M>", "\l<name>", "$l&", and "\l`".
\u...  or  $u...
insert the numbered or named group matched by the RE, converted to upper case. For example, "$uM", "\u<M>", "$u<name>", "\u&", and "$u'".
\c
insert character c (e.g., "\$" gives "$").

Notes on Implementation and Performance

The YACC grammar used to parse the regular expressions is based on the grammar found in Chapter 21 ("Parsing") of Robert Sedgewick's book, Algorithms. The graph representation of a regular expression is based on the representation described in Chapter 20 ("Pattern Matching"). The actual matching of a regular expression to a string is performed using a fairly straightforward, recursive search of the RE graph - I never quite got a handle on Sedgewick's use of double-ended queues! HP/UX's regexp(5) documentation was used as the authoritative source on syntax and semantics of regular expressions. The idea of generating the set of first characters in a regular expression was "stolen" from GNU's REGEX functions (they called it a fastmap). Henry Spencer's test set of regular expressions was used to check out the REX utilities (more on that below).

The REX utilities are slower than the UNIX regular expression utilities. The regexp(3) routines are very fast, but they only support a very basic set of regular expression constructs. The regcmp(3) routines aren't quite as fast, but they support a more powerful set of regular expression constructs. The REX utilities come in third, but their performance is not bad. 100,000 matches of "aaabbbcdef" by RE "a*b*cdef" on a Sparc1 workstation took 12 seconds for the REX utilities (that's about 8300 matches per second) and only 2 seconds for the regexp(3) utilities. Tests of the regcmp(3) utilities on an HP workstation showed that they are about 4 times as fast as the REX utilities.

I obtained a copy of UNIX V8-compatible regexp(3) routines (not to be confused with the regexp(3) routines mentioned above) from UUNET. These functions were written by Henry Spencer, a noted UNIX guru at the University of Toronto; see Gary Houston's regex GitHub repository for Henry's three versions of the package. The REX utilities were generally faster than these regexp(3) routines, but used more memory; the regexp(3) routines used a flat character string representation for REs inspired by Kernighan and Plauger's approach in their book, Software Tools.

I used the test set supplied with Henry Spencer's regexp(3) code to work the kinks out of the REX utilities. The major change that resulted from this exercise was the check for nested, empty closure expressions while parsing an RE. Expressions such as "(a*)*" are now rejected since they would cause endless cycling during a pattern match; innocuous nestings like "(a*)?" are also rejected. Null expressions (e.g., "()" and "(abc|)def") are now allowed. Also, "^" and "$" are always treated as beginning- and end-of-line anchors, unless they're escaped by a forward slash. Consequently, regular expressions such as "^abc|^def$" are now possible.


Public Procedures

rexCompile() - compiles a regular expression.
rexCreate() - creates a compiled RE and/or a matched RE structure.
rexDestroy() - deletes a compiled RE and/or a matched RE structure.
rexGetGroup() - gets a specified capture group from a matched RE structure.
rexMatch() - attempts to match a subject string against a compiled RE.
rexNumGroups() - returns the number of capture groups in a compiled RE.
rexReplace() - search for and replace text matched by a compiled RE.
rexSubstitute() - substitutes matched text for group references in a string.
rexWild() - converts a csh(1)-style wildcard expression to an RE.

Source Files

rex_util.c - includes the internal function, rexSearch().
rex_util.h
rex_util_y.y - regular expression parser.
rex_internals.h - internal representation of REs, including documentation.
rex_internals.c - assorted internal functions.

(See libgpl for the complete source, including support routines and build files.)


Alex Measday  /  E-mail