Regular Expression Utilities

The REX_UTIL package provides simple, regular expression (RE) pattern matching and substitution capabilities. This package is similar to the regcmp(3) package (from the System V Programmers Workbench), but regcmp(3) isn't available on all systems.

And in 2021: I originally wrote these utilities back in late 1989 or early 1990. The original syntax for regular expression matching and replacement was influenced by my experience with lex(1) and yacc(1) and by examination of other implementations.

30 years later, a few years of good experience with the Sigil EPUB editor's regular expression find and replace features motivated me to finally upgrade this package. First of all, I made the parser fully reentrant (thanks to GNU's Bison) and the matching and replacement functions fully reentrant.

The original search function to match a regular expression against text, rexSearch(), was intentionally written in two versions: (i) a fully recursive function implemented as a model for (ii) an iterative function that emulated the recursive function with a manually allocated and managed stack. For 30 years, I've always used the iterative version, although I don't remember if I ever compared the performance of the two versions. The code for the iterative version is understandable after some study, but it is unwieldy. In the midst of making these new changes, I switched to the recursive version. I eliminated tail recursion (which had been eliminated in the original iterative version from Day 1), making rexSearch() a hybrid function with both recursion and iteration.

The following changes were made to the regular expression syntax:

Subexpressions (capture groups) are now numbered by the count of left parentheses.
Capture groups can also be given names in addition to their automatically assigned numbers.
Non-capture groups are supported.

and to the substitution syntax:

Numbered groups can be referenced in the substitution string with "\M", "$M", "\<M>", or "$<M>", where M is a possibly multi-digit group index, 1..N. (The angle brackets are used as delimiters for numbers greater than 9; e.g., "$<910>.)
Named groups can be referenced as "\<name>" or "$<name>". (The angle brackets are used as delimiters for the name; e.g., "$<george>".)
"\`" or "$`" refer to the text preceding the match; "\'" or "$'" refer to the text after the match.

Subexpressions, subpatterns, capture groups, capturing groups, back references, back-references, and backreferences ... what's in a name? (I am considering adding back-references to the package. Or maybe not.)

A kind of small achievement for me was that the REX_UTIL package now (i) successfully compiles Cal Henderson's PCRE 23k+ regular expression for matching RFC 822 email addresses and (ii) successfully parses my email address! His regular expression has 1,413 non-capture groups and 8 actual capture groups; capture groups 1 and 2 capture my username and capture groups 5 and 6 capture the domain name. With that many groups of both kinds, the regular expression is obviously complex — my compiled RE has 6,554 states. However, the regular expression features needed and used are the basic closures, intervals (curly braces), and character classes (square brackets); the most advanced feature is non-capture groups. Upon reflection, my old REX_UTIL package could have compiled the full expression with some minor changes to the RE. Parenthesized groups were, by definition, non-capture groups; a real capture group had to be explicitly labeled by a suffix identifying the group number (1..9), "(...)$n". So, take Cal Henderson's regular expression and (i) search for each of the 8 capture groups (whose left parenthesis is not followed by "?:") and add "$n" to the closing parenthesis and (ii) change all occurrences of "(?:" to "(". (As the author notes in his text file, RFC 822 was superseded by RFC 2822. RFC 2822 was, in turn, superseded by RFC 5322.)

A more recent NOTE (2015): I wrote my regular expression package back in 1990, I think. We were building a distributed satellite control center (TPOCC) for NASA using 3 platforms: SunOS, HP/UX, and VxWorks. I believe the SunOS RE functions only allowed a single RE at a time; HP/UX had regcmp(3) which allowed multiple compiled REs at one time, although the regex() function has a global variable; VxWorks had no RE functions in its libraries at that time. Consequently, I wrote a single package that worked on all 3 systems and allowed you to compile and use multiple REs simultaneously.

Back then, I was heavily influenced by regular expressions in the lex(1) Lexer Generator and in HP/UX's regcmp(3) functions. Thus the differences between my REs and modern REs. In particular, parentheses, not escaped parentheses, are used to group REs: "(RE1|RE2)" instead of "$RE1|RE2$". Matched subexpressions (SE) are numbered specifically, "(...)$2", rather than by the implicit count of the parenthetical groups, "$SE1$REn$SE2$". Lastly, I don't know how it happened, but I really goofed with the character classes. For example, in my package, character class ":alnum:" is inserted as follows into a bracket expression, "[:alnum:]". The correct syntax for the character class is "[:alnum:]", which then produces a bracket expression, "[[:alnum:]]". I've been using it for 25 years; I should probably update it at some point ...

On the plus side, my package has been built and tested under a bunch of different Unices and Linux, under DOS (yes! by a group in Sweden), under Windows from Windows 95 up through Windows 10 (currently), under VxWorks, under VAX/VMS, and for the Nintendo DS handheld gaming system. (I was going to try it on a Sony PSP, but I couldn't wrench the game system out of my son's hands!) The package was also built for PalmOS, but only tested in the Palm OS Emulator (POSE). (I think I tried testing some networking code on my Palm M105 using its serial port as an "ethernet" connection, but it was too kludgy for serious testing.)

And my regular expression package was recommended for inclusion in Tcl/Tk back in 1996. If I remember correctly, the Tcl/Tk developers went with Henry Spencer's older (or newer?) package. One of the commenters was concerned that my entire C library would have had to be incorporated into Tcl; however, looking back at an RE-only library I whipped up at the time shows that the RE package — at that time — only required the aperror, get_util, and str_util functions.

Older NOTE: This regular expression package is over twenty years old—it still works and I still use it. However, there are newer packages out on the web that you might find more suitable. Three packages that I've come across, but have not used myself, I mention here; there are undoubtedly others.

Pattern Matching

Basic Regular Expression (RE) Symbols:

Quantified Regular Expressions:

Concatenation and Alternation:

Grouping:

Capture group 3 can be referenced in a substitution pattern by name ("$<name>") or by number ("$3").

Special Character Sequences:

Character Classes (Outer Square Brackets):

Character Classes (Within the Outer Square Brackets):

Pattern-Directed Text Substitution

The REX_UTIL package's text substitution functions, rexReplace() and rexSubstitute(), support the following special character sequences in replacement text.

Special Character Sequences in Replacement Text:

Notes on Implementation and Performance

The YACC grammar used to parse the regular expressions is based on the grammar found in Chapter 21 ("Parsing") of Robert Sedgewick's book, Algorithms. The graph representation of a regular expression is based on the representation described in Chapter 20 ("Pattern Matching"). The actual matching of a regular expression to a string is performed using a fairly straightforward, recursive search of the RE graph - I never quite got a handle on Sedgewick's use of double-ended queues! HP/UX's regexp(5) documentation was used as the authoritative source on syntax and semantics of regular expressions. The idea of generating the set of first characters in a regular expression was "stolen" from GNU's REGEX functions (they called it a fastmap). Henry Spencer's test set of regular expressions was used to check out the REX utilities (more on that below).

The REX utilities are slower than the UNIX regular expression utilities. The regexp(3) routines are very fast, but they only support a very basic set of regular expression constructs. The regcmp(3) routines aren't quite as fast, but they support a more powerful set of regular expression constructs. The REX utilities come in third, but their performance is not bad. 100,000 matches of "aaabbbcdef" by RE "a*b*cdef" on a Sparc1 workstation took 12 seconds for the REX utilities (that's about 8300 matches per second) and only 2 seconds for the regexp(3) utilities. Tests of the regcmp(3) utilities on an HP workstation showed that they are about 4 times as fast as the REX utilities.

I obtained a copy of UNIX V8-compatible regexp(3) routines (not to be confused with the regexp(3) routines mentioned above) from UUNET. These functions were written by Henry Spencer, a noted UNIX guru at the University of Toronto; see Gary Houston's regex GitHub repository for Henry's three versions of the package. The REX utilities were generally faster than these regexp(3) routines, but used more memory; the regexp(3) routines used a flat character string representation for REs inspired by Kernighan and Plauger's approach in their book, Software Tools.

I used the test set supplied with Henry Spencer's regexp(3) code to work the kinks out of the REX utilities. The major change that resulted from this exercise was the check for nested, empty closure expressions while parsing an RE. Expressions such as "(a*)*" are now rejected since they would cause endless cycling during a pattern match; innocuous nestings like "(a*)?" are also rejected. Null expressions (e.g., "()" and "(abc|)def") are now allowed. Also, "^" and "$" are always treated as beginning- and end-of-line anchors, unless they're escaped by a forward slash. Consequently, regular expressions such as "^abc|^def$" are now possible.

Public Procedures

Source Files

(See libgpl for the complete source, including support routines and build files.)

rex_util - Regular Expression Utilities