|
|
|
rex_util
- Regular Expression Utilities
The REX_UTIL package provides simple, regular expression (RE) pattern
matching and substitution capabilities. This package is similar to the
regcmp(3)
package (from the System V Programmers Workbench),
but regcmp(3)
isn't available on all systems.
And in 2021: I originally wrote these utilities
back in late 1989 or early 1990. The original syntax for regular expression
matching and replacement was influenced by my experience with
lex(1)
and
yacc(1)
and by examination of other implementations.
30 years later, a few years of good experience with the Sigil EPUB editor's regular expression find and replace features motivated me to finally upgrade this package. First of all, I made the parser fully reentrant (thanks to GNU's Bison) and the matching and replacement functions fully reentrant.
The original search function to match a regular expression against text,
rexSearch()
, was intentionally written in two versions:
(i) a fully recursive function implemented as a model for
(ii) an iterative function that emulated the recursive function
with a manually allocated and managed stack. For 30 years, I've always
used the iterative version, although I don't remember if I ever compared
the performance of the two versions. The code for the iterative version
is understandable after some study, but it is unwieldy. In the midst of
making these new changes, I switched to the recursive version. I eliminated
tail recursion (which had been eliminated in the original iterative version
from Day 1), making rexSearch()
a hybrid function with both
recursion and iteration.
The following changes were made to the regular expression syntax:
Subexpressions (capture groups) are now numbered by the count of left parentheses.
Capture groups can also be given names in addition to their automatically assigned numbers.
Non-capture groups are supported.
and to the substitution syntax:
Numbered groups can be referenced in the substitution string with
"\M
", "$M
",
"\<M>
", or "$<M>
",
where M is a possibly multi-digit group index, 1..N.
(The angle brackets are used as delimiters for numbers greater than 9;
e.g., "$<910>
.)
Named groups can be referenced as "\<name>
"
or "$<name>
". (The angle brackets are used
as delimiters for the name; e.g., "$<george>
".)
"\`
" or "$`
" refer to the text preceding
the match; "\'
" or "$'
" refer to the text
after the match.
Subexpressions, subpatterns, capture groups, capturing groups, back references, back-references, and backreferences ... what's in a name? (I am considering adding back-references to the package. Or maybe not.)
A kind of small achievement for me was that the REX_UTIL package now (i) successfully compiles Cal Henderson's PCRE 23k+ regular expression for matching RFC 822 email addresses and (ii) successfully parses my email address! His regular expression has 1,413 non-capture groups and 8 actual capture groups; capture groups 1 and 2 capture my username and capture groups 5 and 6 capture the domain name. With that many groups of both kinds, the regular expression is obviously complex — my compiled RE has 6,554 states. However, the regular expression features needed and used are the basic closures, intervals (curly braces), and character classes (square brackets); the most advanced feature is non-capture groups. Upon reflection, my old REX_UTIL package could have compiled the full expression with some minor changes to the RE. Parenthesized groups were, by definition, non-capture groups; a real capture group had to be explicitly labeled by a suffix identifying the group number (1..9), "(...)$n". So, take Cal Henderson's regular expression and (i) search for each of the 8 capture groups (whose left parenthesis is not followed by "?:") and add "$n" to the closing parenthesis and (ii) change all occurrences of "(?:" to "(". (As the author notes in his text file, RFC 822 was superseded by RFC 2822. RFC 2822 was, in turn, superseded by RFC 5322.)
A more recent NOTE (2015): I wrote my regular
expression package back in 1990, I think. We were building a distributed
satellite control center (TPOCC) for NASA using 3 platforms: SunOS, HP/UX,
and VxWorks. I believe the SunOS RE functions only allowed a single RE at
a time; HP/UX had regcmp(3)
which allowed multiple compiled REs
at one time, although the regex()
function has a global variable;
VxWorks had no RE functions in its libraries at that time. Consequently,
I wrote a single package that worked on all 3 systems and allowed you to
compile and use multiple REs simultaneously.
Back then, I was heavily influenced by regular expressions in the
lex(1)
Lexer Generator and in HP/UX's
regcmp(3)
functions. Thus the differences between my REs and modern REs. In
particular, parentheses, not escaped parentheses, are used to group REs:
"(RE1|RE2)" instead of "\(RE1|RE2\)". Matched subexpressions (SE) are
numbered specifically, "(...)$2", rather than by the implicit count of
the parenthetical groups, "\(SE1\)REn\(SE2\)". Lastly, I don't know how
it happened, but I really goofed with the character classes. For example,
in my package, character class ":alnum:" is inserted as follows into a
bracket expression, "[:alnum:]". The correct syntax for the character
class is "[:alnum:]", which then produces a bracket expression, "[[:alnum:]]".
I've been using it for 25 years; I should probably update it at some point ...
On the plus side, my package has been built and tested under a bunch of different Unices and Linux, under DOS (yes! by a group in Sweden), under Windows from Windows 95 up through Windows 10 (currently), under VxWorks, under VAX/VMS, and for the Nintendo DS handheld gaming system. (I was going to try it on a Sony PSP, but I couldn't wrench the game system out of my son's hands!) The package was also built for PalmOS, but only tested in the Palm OS Emulator (POSE). (I think I tried testing some networking code on my Palm M105 using its serial port as an "ethernet" connection, but it was too kludgy for serious testing.)
And my regular expression package was
recommended
for inclusion in Tcl/Tk back in 1996. If I remember correctly, the Tcl/Tk
developers went with Henry
Spencer's older (or newer?) package. One of the commenters was concerned
that my entire C library would have had to be incorporated into Tcl; however,
looking back at an RE-only library I whipped up at the time shows that the RE
package — at that time — only required the
aperror
, get_util
, and str_util
functions.
Older NOTE: This regular expression package is over twenty years old—it still works and I still use it. However, there are newer packages out on the web that you might find more suitable. Three packages that I've come across, but have not used myself, I mention here; there are undoubtedly others.
regex(3)
, so I let an old version of Regex++
supply that under Windows.)
The REX utilities support the following regular expression constructs:
\
c
(
" is a special symbol indicating
the start of an RE group and doesn't actually match any
character itself. However, "\(
" has nothing to do with
groups and matches a literal left parenthesis in the target text.
.
^
rexMatch()
and
rexReplace()
will scan the entire text string looking for a
match. For example, regular expression "abc
" would match
those 3 characters in both abcxyz and
xyzabc. Anchored regular expression "^abc
",
on the other hand, will only match the first 3 characters in
abcxyz; no match will be made in xyzabc.
$
abc$
" matches those 3
characters in xyzabc, but not in abcxyz.
^
RE$
") must consume the entire target
string in order for the match to succeed. Aside from appearing at the very
beginning and at the very end of the complete RE, these anchors can also be
used with the same meanings in alternation expressions; for example,
"^abc|^def$
" is equivalent to matching "^abc$
" or
"^def$
".
*
+
?
{
m}
{
m,
n}
|
RE2
(
RE)
(?<
name>
RE)
(?:
RE)
For example,
"((RE1)(?:RE2)(?<name>RE3))
"
has 3 visible capture groups:
Capture group 3 can be referenced in a substitution pattern by name
("$<name>
") or by number ("$3
").
\a \b \f \n \r \t \v
\d
[0-9]
".
\D
[^0-9]
".
\x
HH
\u
HHHH
\U
HHHHHH
\u{
HHH...}
[\U01F3BB]
", it will be replaced internally by
the 4 UTF-8 "characters": "[\xF0\x9F\x8E\xBB]
".
The latter will match any single one of the UTF-8 "characters",
but not the sequence of 4 bytes. Alternation can be used to
simulate the desired behavior:
"[
...whatever...]|\U01F3BB
".
[
...]
[^
...]
-
c
[123a-z45]
"
matches the numbers '1', '2', '3', '4', and '5' and the
lower-case English alphabet (letters 'a' through 'z').
[aeiou]
"
matches a vowel.
[:
class:]
(must
appear inside [
...]
or [^
...]
)
[:alnum:] [:digit:] [:space:] [:alpha:] [:graph:] [:upper:] [:ascii:] [:lower:] [:xdigit:] [:blank:] [:print:] [:word:] [:cntrl:] [:punct:]See the man entries for
isalnum(3)
, etc. for a description
of these classes. NOTE that these brackets are inner square
brackets and must appear inside the outer square brackets; e.g.,
[^A-Z[:lower:]@#![:digit:]]
A-Z
") or lower-case ("[:lower:]
")
alphabetic characters, punctuation marks @, #, and ! ("@#!
"),
or decimal digits ("[:digit:]
").
The REX_UTIL package's text substitution functions, rexReplace()
and rexSubstitute()
, support the following special character
sequences in replacement text.
\
M or $
M
\<
M>
or $<
M>
\<
name>
$<
name>
\&
or $&
\0
or $0
\`
or $`
\'
or $'
\l
... or $l
...
\l
M",
"$l<
M>
",
"\l<
name>
",
"$l&
", and "\l`
".
\u
... or $u
...
$u
M",
"\u<
M>
",
"$u<
name>
",
"\u&
", and "$u'
".
\
c
\$
" gives "$
").
The YACC grammar used to parse the regular expressions is based on the
grammar found in Chapter 21 ("Parsing") of Robert Sedgewick's book,
Algorithms. The graph representation of a regular expression
is based on the representation described in Chapter 20 ("Pattern
Matching"). The actual matching of a regular expression to a string is
performed using a fairly straightforward, recursive search of the RE graph
- I never quite got a handle on Sedgewick's use of double-ended queues!
HP/UX's regexp(5)
documentation was used as the authoritative
source on syntax and semantics of regular expressions. The idea of
generating the set of first characters in a regular expression was "stolen"
from GNU's REGEX functions (they called it a fastmap). Henry
Spencer's test set of regular expressions was used to check out the REX
utilities (more on that below).
The REX utilities are slower than the UNIX regular expression utilities.
The regexp(3)
routines are very fast, but they only support a
very basic set of regular expression constructs. The
regcmp(3)
routines aren't quite as fast, but they support a
more powerful set of regular expression constructs. The REX utilities come
in third, but their performance is not bad. 100,000 matches of
"aaabbbcdef
" by RE "a*b*cdef
" on a Sparc1 workstation took 12
seconds for the REX utilities (that's about 8300 matches per second) and
only 2 seconds for the regexp(3)
utilities. Tests of the
regcmp(3)
utilities on an HP workstation showed that they are
about 4 times as fast as the REX utilities.
I obtained a copy of UNIX V8-compatible regexp(3)
routines
(not to be confused with the regexp(3)
routines mentioned
above) from UUNET. These functions were written by Henry Spencer,
a noted UNIX guru at the University of Toronto; see Gary Houston's
regex GitHub repository
for Henry's three versions of the package. The REX utilities were generally
faster than these regexp(3)
routines, but used more memory; the
regexp(3)
routines used a flat character string representation
for REs inspired by Kernighan and Plauger's approach in their book,
Software Tools.
I used the
test
set supplied with Henry Spencer's regexp(3)
code to work
the kinks out of the REX utilities. The major change that resulted from
this exercise was the check for nested, empty closure expressions while
parsing an RE. Expressions such as "(a*)*
" are now rejected
since they would cause endless cycling during a pattern match; innocuous
nestings like "(a*)?
" are also rejected. Null expressions
(e.g., "()
" and "(abc|)def
") are now allowed.
Also, "^
" and "$
" are always treated as beginning-
and end-of-line anchors, unless they're escaped by a forward slash.
Consequently, regular expressions such as "^abc|^def$
"
are now possible.
rexCompile()
- compiles a regular expression.
rexCreate()
- creates a compiled RE and/or a matched RE
structure.
rexDestroy()
- deletes a compiled RE and/or a matched RE
structure.
rexGetGroup()
- gets a specified capture group from
a matched RE structure.
rexMatch()
- attempts to match a subject string against
a compiled RE.
rexNumGroups()
- returns the number of capture groups
in a compiled RE.
rexReplace()
- search for and replace text matched by a
compiled RE.
rexSubstitute()
- substitutes matched text for group
references in a string.
rexWild()
- converts a csh(1)
-style wildcard
expression to an RE.
rex_util.c
- includes the internal function, rexSearch()
.
rex_util.h
rex_util_y.y
- regular expression parser.
rex_internals.h
- internal representation of REs, including documentation.
rex_internals.c
- assorted internal functions.
(See libgpl
for the
complete source, including support routines and build files.)