Home

GEONius.com
28-Feb-2016
E-mail

rex_util - Regular Expression Utilities

The REX_UTIL package provides simple, regular expression (RE) pattern matching and substitution capabilities. This package is similar to the regcmp(3) package (from the System V Programmers Workbench), but regcmp(3) isn't available on all systems.

A more recent NOTE (2015): I wrote my regular expression package back in 1990, I think. We were building a distributed satellite control center (TPOCC) for NASA using 3 platforms: SunOS, HP/UX, and VxWorks. I believe the SunOS RE functions only allowed a single RE at a time; HP/UX had regcmp(3) which allowed multiple compiled REs at one time, although the regex() function has a global variable; VxWorks had no RE functions in its libraries at that time. Consequently, I wrote a single package that worked on all 3 systems and allowed you to compile and use multiple REs simultaneously.

Back then, I was heavily influenced by regular expressions in the lex(1) Lexer Generator and in HP/UX's regcmp(3) functions. Thus the differences between my REs and modern REs. In particular, parentheses, not escaped parentheses, are used to group REs: "(RE1|RE2)" instead of "\(RE1|RE2\)". Matched subexpressions (SE) are numbered specifically, "(...)$2", rather than by the implicit count of the parenthetical groups, "\(SE1\)REn\(SE2\)". Lastly, I don't know how it happened, but I really goofed with the character classes. For example, in my package, character class ":alnum:" is inserted as follows into a bracket expression, "[:alnum:]". The correct syntax for the character class is "[:alnum:]", which then produces a bracket expression, "[[:alnum:]]". I've been using it for 25 years; I should probably update it at some point ...

On the plus side, my package has been built and tested under a bunch of different Unices and Linux, under DOS (yes! by a group in Sweden), under Windows from Windows 95 up through Windows 10 (currently), under VxWorks, under VAX/VMS, and for the Nintendo DS handheld gaming system. (I was going to try it on a Sony PSP, but I couldn't wrench the game system out of my son's hands!) The package was also built for PalmOS, but only tested in the Palm OS Emulator (POSE). (I think I tried testing some networking code on my Palm M105 using its serial port as an "ethernet" connection, but it was too kludgy for serious testing.)

And my regular expression package was recommended for inclusion in Tcl/Tk back in 1996. If I remember correctly, the Tcl/Tk developers went with Henry Spencer's older package. (I imagine it was replaced later by a more full-featured package.) One of the commenters was concerned that my entire C library would have had to be incorporated into Tcl; however, looking back at an RE-only library I whipped up at the time shows that the RE package only required the aperror, get_util, and str_util functions.

Older NOTE: This regular expression package is over twenty years old—it still works and I still use it. However, there are newer packages out on the web that you might find more suitable. Three packages that I've come across, but have not used myself, I mention here; there are undoubtedly others.

Pattern Matching

The REX utilities support the following regular expression constructs:

"c"
matches one instance of the specified character. For example, regular expression "abc" matches the string abc. To match a special character, such as "." or "$", escape it with a forward slash: "\c".
"."
matches one instance of any character. For example, regular expression "a.c" matches the text strings aac, abc, acc, etc.
"^"
at the beginning of a regular expression, anchors the left edge of the match at the beginning of the target string. If a regular expression is not anchored, rex_match() and rex_replace() will scan the entire text string looking for a match. For example, regular expression "abc" would match both abc and xyzabc. Anchored regular expression "^abc", on the other hand, will only match abc, not xyzabc.
"$"
at the end of a regular expression, anchors the right edge of the match to the end of the target string. For example, regular expression "abc$" matches abc and xyzabc, but not abcxyz. A regular expression anchored at both ends ("^...$") must consume the entire target string in order for the match to succeed.
"[characters]"
matches one instance of a character in the specified set of characters. Character sets can be specified literally (e.g., "[abc]" matches a, b, or c) or as a range of characters (e.g., "[A-Za-z0-9]" matches any upper case letter, lower case letter, or digit.
"[^characters]"
matches one instance of a character not in the specified set of characters.
"[:class:]"
matches one instance of any character that belongs to the specified class of characters. The possible classes are alpha, upper, lower, digit, xdigit, alnum, space, punct, print, cntrl, and graph. Class names are not case-sensitive. Although the meanings of some of the classes are obvious, check the UNIX documentation for ctype(3) before using this regular expression construct.
"[^:class:]"
matches one instance of any character that does not belong to the specified class of characters.
"RE*"
matches zero or more instances of regular expression RE. For example, "a*b" matches b, ab, aab, aaab, etc.
"RE+"
matches one or more instances of regular expression RE. For example, "a+b" matches ab, aab, aaab, etc.
"RE?"
matches zero or one instance of regular expression RE. For example, "a?b" matches b and ab only.
"RE{[m][,[n]]}"
matches m through n instances of regular expression RE. If not specified, m defaults to 0 and n defaults to m ("RE{m}") or to a very large number ("RE{m,}"). "RE*" is equivalent to "RE{0,}". "RE+" is equivalent to "RE{1,}". "RE?" is equivalent to "RE{0,1}".
"(RE)"
matches regular expression RE. The parentheses allow you to group regular expressions. For example, regular expression "(ab)*cd" matches cd, abcd, ababcd, etc.
"(RE)$n"
matches regular expression RE and assigns the text matched by RE to the nth variable (where n is a single digit in the range 0 through 9). For example, if regular expression "((ab)*)$0(cd)$1" matches ababcd in a call to rex_match(), then abab is stored in retval0 and cd in retval1.
"RE1 RE2"
matches regular expression RE1 followed immediately by regular expression RE2; there should be no intervening spaces in the composite regular expression or in the target string. For example, regular expression "a*b" is the concatenation of regular expressions "a*" and "b".
"RE1 | RE2"
matches either RE1 or RE2. For example, regular expression "((abc)|(def))ghi" matches abcghi and defghi.

The regcmp(3) utilities support everything except alternation ("RE1|RE2"), zero-or-one closure ("RE?"), and character classification matching ("[:class:]"). Its "RE{m,n}" construct limits m and n to 256.

Pattern-Directed Text Substitution

The REX_UTIL package's text substitution function, rex_replace(), supports the following special character sequences in replacement text:

"$n"
Insert the subexpression (0..9) matched by "(RE)$n" in the regular expression.
"$&"
Insert the text matched by the full regular expression. For example, "$&$&" replaces the matched text by two copies of itself.
"$ln"
Insert the matched subexpression (0..9), converted to lower case.
"$l&"
Insert the text matched by the full regular expression, converted to lower case.
"$un"
Insert the matched subexpression (0..9), converted to upper case.
"$u&"
Insert the text matched by the full regular expression, converted to upper case.
"\c"
Insert character c; e.g., "\$" inserts a dollar sign in the replacement text.

Notes on Implementation and Performance

The YACC grammar used to parse the regular expressions is based on the grammar found in Chapter 21 ("Parsing") of Robert Sedgewick's book, Algorithms. The graph representation of a regular expression is based on the representation described in Chapter 20 ("Pattern Matching"). The actual matching of a regular expression to a string is performed using a fairly straightforward, recursive search of the RE graph - I never quite got a handle on Sedgewick's use of double-ended queues! HP/UX's regexp(5) documentation was used as the authoritative source on syntax and semantics of regular expressions. The idea of generating the set of first characters in a regular expression was "stolen" from GNU's REGEX functions (they called it a fastmap). Henry Spencer's test set of regular expressions was used to check out the REX utilities (more on that below).

The REX utilities are slower than the UNIX regular expression utilities. The regexp(3) routines are very fast, but they only support a very basic set of regular expression constructs. The regcmp(3) routines aren't quite as fast, but they support a more powerful set of regular expression constructs. The REX utilities come in third, but their performance is not bad. 100,000 matches of "aaabbbcdef" by RE "a*b*cdef" on a Sparc1 workstation took 12 seconds for the REX utilities (that's about 8300 matches per second) and only 2 seconds for the regexp(3) utilities. Tests of the regcmp(3) utilities on an HP workstation showed that they are about 4 times as fast as the REX utilities.

I obtained a copy of UNIX V8-compatible regexp(3) routines (not to be confused with the regexp(3) routines mentioned above) from UUNET. These functions were written by Henry Spencer, a noted UNIX guru at the University of Toronto. The REX utilities were generally faster than these regexp(3) routines, but used more memory; the regexp(3) routines used a flat character string representation for REs inspired by Kernighan and Plauger's approach in their book, Software Tools.

I used the test set supplied with Henry Spencer's regexp(3) code to work the kinks out of the REX utilities. The major change that resulted from this exercise was the check for nested, empty closure expressions while parsing an RE. Expressions such as "(a*)*" are now rejected since they would cause endless cycling during a pattern match; innocuous nestings like "(a*)?" are also rejected. Null expressions (e.g., "()" and "(abc|)def") are now allowed. Also, "^" and "$" are always treated as beginning- and end-of-line anchors, unless they're escaped by a forward slash. Consequently, regular expressions such as "^abc|^def$" are now possible.


Public Procedures

rex_compile() - compiles a regular expression.
rex_delete() - deletes a compiled regular expression.
rex_dump() - dumps a compiled regular expression pattern to a file.
rex_match() - attempts to match a subject string against a compiled RE.
rex_nosex() - returns the number of subexpression assignments defined for a compiled RE.
rex_replace() - search for and replace text matched by a compiled RE.
rex_wild() - converts a csh(1)-style wildcard expression to an RE.

Source Files

rex_util.c
rex_util.h
rex_util_y.y - regular expression parser.
rex_internals.h - internal representation of REs.

(See libgpl for the complete source, including support routines and build files.)


Alex Measday  /  E-mail