e1e595a649
* Makefile.in (install-headers): Remove extra command to install regex.h. (uninstall-headers): Remove extra command to uninstall regex.h. * nlsfuncs.cc (collate_lcid): Make externally available to allow access to collation internals from regex functions. (collate_charset): Ditto. * wchar.h: Add __cplusplus guards to make C-clean. * include/regex.h: New file, replacing regex/regex.h. Remove UCB advertising clause. * regex/COPYRIGHT: Accommodate BSD license. Remove UCB advertising clause. * regex/cclass.h: Remove. * regex/cname.h: New file from FreeBSD. * regex/engine.c: Ditto. (NONCHAR): Tweak for Cygwin. * regex/engine.ih: Remove. * regex/mkh: Remove. * regex/regcomp.c: New file from FreeBSD. Tweak slightly for Cygwin. Import required collate internals from nlsfunc.cc. (p_ere_exp): Add GNU-specific \< and \> handling for word boundaries. (p_simp_re): Ditto. (__collate_range_cmp): Define. (p_b_term): Use Cygwin-specific collate internals. (findmust): Ditto. * regex/regcomp.ih: Remove. * regex/regerror.c: New file from FreeBSD. Fix a few compiler warnings. * regex/regerror.ih: Remove. * regex/regex.7: New file from FreeBSD. Remove UCB advertising clause. * regex/regex.h: Remove. Replaced by include/regex.h. * regex/regexec.c: New file from FreeBSD. Fix a few compiler warnings. * regex/regfree.c: New file from FreeBSD. * regex/tests: Remove. * regex/utils.h: New file from FreeBSD.
728 lines
17 KiB
Groff
728 lines
17 KiB
Groff
.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
|
|
.\" Copyright (c) 1992, 1993, 1994
|
|
.\" The Regents of the University of California. All rights reserved.
|
|
.\"
|
|
.\" This code is derived from software contributed to Berkeley by
|
|
.\" Henry Spencer.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\" 4. Neither the name of the University nor the names of its contributors
|
|
.\" may be used to endorse or promote products derived from this software
|
|
.\" without specific prior written permission.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" @(#)regex.3 8.4 (Berkeley) 3/20/94
|
|
.\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
|
|
.\"
|
|
.Dd August 17, 2005
|
|
.Dt REGEX 3
|
|
.Os
|
|
.Sh NAME
|
|
.Nm regcomp ,
|
|
.Nm regexec ,
|
|
.Nm regerror ,
|
|
.Nm regfree
|
|
.Nd regular-expression library
|
|
.Sh LIBRARY
|
|
.Lb libc
|
|
.Sh SYNOPSIS
|
|
.In regex.h
|
|
.Ft int
|
|
.Fo regcomp
|
|
.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
|
|
.Fc
|
|
.Ft int
|
|
.Fo regexec
|
|
.Fa "const regex_t * restrict preg" "const char * restrict string"
|
|
.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
|
|
.Fc
|
|
.Ft size_t
|
|
.Fo regerror
|
|
.Fa "int errcode" "const regex_t * restrict preg"
|
|
.Fa "char * restrict errbuf" "size_t errbuf_size"
|
|
.Fc
|
|
.Ft void
|
|
.Fn regfree "regex_t *preg"
|
|
.Sh DESCRIPTION
|
|
These routines implement
|
|
.St -p1003.2
|
|
regular expressions
|
|
.Pq Do RE Dc Ns s ;
|
|
see
|
|
.Xr re_format 7 .
|
|
The
|
|
.Fn regcomp
|
|
function
|
|
compiles an RE written as a string into an internal form,
|
|
.Fn regexec
|
|
matches that internal form against a string and reports results,
|
|
.Fn regerror
|
|
transforms error codes from either into human-readable messages,
|
|
and
|
|
.Fn regfree
|
|
frees any dynamically-allocated storage used by the internal form
|
|
of an RE.
|
|
.Pp
|
|
The header
|
|
.In regex.h
|
|
declares two structure types,
|
|
.Ft regex_t
|
|
and
|
|
.Ft regmatch_t ,
|
|
the former for compiled internal forms and the latter for match reporting.
|
|
It also declares the four functions,
|
|
a type
|
|
.Ft regoff_t ,
|
|
and a number of constants with names starting with
|
|
.Dq Dv REG_ .
|
|
.Pp
|
|
The
|
|
.Fn regcomp
|
|
function
|
|
compiles the regular expression contained in the
|
|
.Fa pattern
|
|
string,
|
|
subject to the flags in
|
|
.Fa cflags ,
|
|
and places the results in the
|
|
.Ft regex_t
|
|
structure pointed to by
|
|
.Fa preg .
|
|
The
|
|
.Fa cflags
|
|
argument
|
|
is the bitwise OR of zero or more of the following flags:
|
|
.Bl -tag -width REG_EXTENDED
|
|
.It Dv REG_EXTENDED
|
|
Compile modern
|
|
.Pq Dq extended
|
|
REs,
|
|
rather than the obsolete
|
|
.Pq Dq basic
|
|
REs that
|
|
are the default.
|
|
.It Dv REG_BASIC
|
|
This is a synonym for 0,
|
|
provided as a counterpart to
|
|
.Dv REG_EXTENDED
|
|
to improve readability.
|
|
.It Dv REG_NOSPEC
|
|
Compile with recognition of all special characters turned off.
|
|
All characters are thus considered ordinary,
|
|
so the
|
|
.Dq RE
|
|
is a literal string.
|
|
This is an extension,
|
|
compatible with but not specified by
|
|
.St -p1003.2 ,
|
|
and should be used with
|
|
caution in software intended to be portable to other systems.
|
|
.Dv REG_EXTENDED
|
|
and
|
|
.Dv REG_NOSPEC
|
|
may not be used
|
|
in the same call to
|
|
.Fn regcomp .
|
|
.It Dv REG_ICASE
|
|
Compile for matching that ignores upper/lower case distinctions.
|
|
See
|
|
.Xr re_format 7 .
|
|
.It Dv REG_NOSUB
|
|
Compile for matching that need only report success or failure,
|
|
not what was matched.
|
|
.It Dv REG_NEWLINE
|
|
Compile for newline-sensitive matching.
|
|
By default, newline is a completely ordinary character with no special
|
|
meaning in either REs or strings.
|
|
With this flag,
|
|
.Ql [^
|
|
bracket expressions and
|
|
.Ql .\&
|
|
never match newline,
|
|
a
|
|
.Ql ^\&
|
|
anchor matches the null string after any newline in the string
|
|
in addition to its normal function,
|
|
and the
|
|
.Ql $\&
|
|
anchor matches the null string before any newline in the
|
|
string in addition to its normal function.
|
|
.It Dv REG_PEND
|
|
The regular expression ends,
|
|
not at the first NUL,
|
|
but just before the character pointed to by the
|
|
.Va re_endp
|
|
member of the structure pointed to by
|
|
.Fa preg .
|
|
The
|
|
.Va re_endp
|
|
member is of type
|
|
.Ft "const char *" .
|
|
This flag permits inclusion of NULs in the RE;
|
|
they are considered ordinary characters.
|
|
This is an extension,
|
|
compatible with but not specified by
|
|
.St -p1003.2 ,
|
|
and should be used with
|
|
caution in software intended to be portable to other systems.
|
|
.El
|
|
.Pp
|
|
When successful,
|
|
.Fn regcomp
|
|
returns 0 and fills in the structure pointed to by
|
|
.Fa preg .
|
|
One member of that structure
|
|
(other than
|
|
.Va re_endp )
|
|
is publicized:
|
|
.Va re_nsub ,
|
|
of type
|
|
.Ft size_t ,
|
|
contains the number of parenthesized subexpressions within the RE
|
|
(except that the value of this member is undefined if the
|
|
.Dv REG_NOSUB
|
|
flag was used).
|
|
If
|
|
.Fn regcomp
|
|
fails, it returns a non-zero error code;
|
|
see
|
|
.Sx DIAGNOSTICS .
|
|
.Pp
|
|
The
|
|
.Fn regexec
|
|
function
|
|
matches the compiled RE pointed to by
|
|
.Fa preg
|
|
against the
|
|
.Fa string ,
|
|
subject to the flags in
|
|
.Fa eflags ,
|
|
and reports results using
|
|
.Fa nmatch ,
|
|
.Fa pmatch ,
|
|
and the returned value.
|
|
The RE must have been compiled by a previous invocation of
|
|
.Fn regcomp .
|
|
The compiled form is not altered during execution of
|
|
.Fn regexec ,
|
|
so a single compiled RE can be used simultaneously by multiple threads.
|
|
.Pp
|
|
By default,
|
|
the NUL-terminated string pointed to by
|
|
.Fa string
|
|
is considered to be the text of an entire line, minus any terminating
|
|
newline.
|
|
The
|
|
.Fa eflags
|
|
argument is the bitwise OR of zero or more of the following flags:
|
|
.Bl -tag -width REG_STARTEND
|
|
.It Dv REG_NOTBOL
|
|
The first character of
|
|
the string
|
|
is not the beginning of a line, so the
|
|
.Ql ^\&
|
|
anchor should not match before it.
|
|
This does not affect the behavior of newlines under
|
|
.Dv REG_NEWLINE .
|
|
.It Dv REG_NOTEOL
|
|
The NUL terminating
|
|
the string
|
|
does not end a line, so the
|
|
.Ql $\&
|
|
anchor should not match before it.
|
|
This does not affect the behavior of newlines under
|
|
.Dv REG_NEWLINE .
|
|
.It Dv REG_STARTEND
|
|
The string is considered to start at
|
|
.Fa string
|
|
+
|
|
.Fa pmatch Ns [0]. Ns Va rm_so
|
|
and to have a terminating NUL located at
|
|
.Fa string
|
|
+
|
|
.Fa pmatch Ns [0]. Ns Va rm_eo
|
|
(there need not actually be a NUL at that location),
|
|
regardless of the value of
|
|
.Fa nmatch .
|
|
See below for the definition of
|
|
.Fa pmatch
|
|
and
|
|
.Fa nmatch .
|
|
This is an extension,
|
|
compatible with but not specified by
|
|
.St -p1003.2 ,
|
|
and should be used with
|
|
caution in software intended to be portable to other systems.
|
|
Note that a non-zero
|
|
.Va rm_so
|
|
does not imply
|
|
.Dv REG_NOTBOL ;
|
|
.Dv REG_STARTEND
|
|
affects only the location of the string,
|
|
not how it is matched.
|
|
.El
|
|
.Pp
|
|
See
|
|
.Xr re_format 7
|
|
for a discussion of what is matched in situations where an RE or a
|
|
portion thereof could match any of several substrings of
|
|
.Fa string .
|
|
.Pp
|
|
Normally,
|
|
.Fn regexec
|
|
returns 0 for success and the non-zero code
|
|
.Dv REG_NOMATCH
|
|
for failure.
|
|
Other non-zero error codes may be returned in exceptional situations;
|
|
see
|
|
.Sx DIAGNOSTICS .
|
|
.Pp
|
|
If
|
|
.Dv REG_NOSUB
|
|
was specified in the compilation of the RE,
|
|
or if
|
|
.Fa nmatch
|
|
is 0,
|
|
.Fn regexec
|
|
ignores the
|
|
.Fa pmatch
|
|
argument (but see below for the case where
|
|
.Dv REG_STARTEND
|
|
is specified).
|
|
Otherwise,
|
|
.Fa pmatch
|
|
points to an array of
|
|
.Fa nmatch
|
|
structures of type
|
|
.Ft regmatch_t .
|
|
Such a structure has at least the members
|
|
.Va rm_so
|
|
and
|
|
.Va rm_eo ,
|
|
both of type
|
|
.Ft regoff_t
|
|
(a signed arithmetic type at least as large as an
|
|
.Ft off_t
|
|
and a
|
|
.Ft ssize_t ) ,
|
|
containing respectively the offset of the first character of a substring
|
|
and the offset of the first character after the end of the substring.
|
|
Offsets are measured from the beginning of the
|
|
.Fa string
|
|
argument given to
|
|
.Fn regexec .
|
|
An empty substring is denoted by equal offsets,
|
|
both indicating the character following the empty substring.
|
|
.Pp
|
|
The 0th member of the
|
|
.Fa pmatch
|
|
array is filled in to indicate what substring of
|
|
.Fa string
|
|
was matched by the entire RE.
|
|
Remaining members report what substring was matched by parenthesized
|
|
subexpressions within the RE;
|
|
member
|
|
.Va i
|
|
reports subexpression
|
|
.Va i ,
|
|
with subexpressions counted (starting at 1) by the order of their opening
|
|
parentheses in the RE, left to right.
|
|
Unused entries in the array (corresponding either to subexpressions that
|
|
did not participate in the match at all, or to subexpressions that do not
|
|
exist in the RE (that is,
|
|
.Va i
|
|
>
|
|
.Fa preg Ns -> Ns Va re_nsub ) )
|
|
have both
|
|
.Va rm_so
|
|
and
|
|
.Va rm_eo
|
|
set to -1.
|
|
If a subexpression participated in the match several times,
|
|
the reported substring is the last one it matched.
|
|
(Note, as an example in particular, that when the RE
|
|
.Ql "(b*)+"
|
|
matches
|
|
.Ql bbb ,
|
|
the parenthesized subexpression matches each of the three
|
|
.So Li b Sc Ns s
|
|
and then
|
|
an infinite number of empty strings following the last
|
|
.Ql b ,
|
|
so the reported substring is one of the empties.)
|
|
.Pp
|
|
If
|
|
.Dv REG_STARTEND
|
|
is specified,
|
|
.Fa pmatch
|
|
must point to at least one
|
|
.Ft regmatch_t
|
|
(even if
|
|
.Fa nmatch
|
|
is 0 or
|
|
.Dv REG_NOSUB
|
|
was specified),
|
|
to hold the input offsets for
|
|
.Dv REG_STARTEND .
|
|
Use for output is still entirely controlled by
|
|
.Fa nmatch ;
|
|
if
|
|
.Fa nmatch
|
|
is 0 or
|
|
.Dv REG_NOSUB
|
|
was specified,
|
|
the value of
|
|
.Fa pmatch Ns [0]
|
|
will not be changed by a successful
|
|
.Fn regexec .
|
|
.Pp
|
|
The
|
|
.Fn regerror
|
|
function
|
|
maps a non-zero
|
|
.Fa errcode
|
|
from either
|
|
.Fn regcomp
|
|
or
|
|
.Fn regexec
|
|
to a human-readable, printable message.
|
|
If
|
|
.Fa preg
|
|
is
|
|
.No non\- Ns Dv NULL ,
|
|
the error code should have arisen from use of
|
|
the
|
|
.Ft regex_t
|
|
pointed to by
|
|
.Fa preg ,
|
|
and if the error code came from
|
|
.Fn regcomp ,
|
|
it should have been the result from the most recent
|
|
.Fn regcomp
|
|
using that
|
|
.Ft regex_t .
|
|
The
|
|
.Fn ( regerror
|
|
may be able to supply a more detailed message using information
|
|
from the
|
|
.Ft regex_t . )
|
|
The
|
|
.Fn regerror
|
|
function
|
|
places the NUL-terminated message into the buffer pointed to by
|
|
.Fa errbuf ,
|
|
limiting the length (including the NUL) to at most
|
|
.Fa errbuf_size
|
|
bytes.
|
|
If the whole message will not fit,
|
|
as much of it as will fit before the terminating NUL is supplied.
|
|
In any case,
|
|
the returned value is the size of buffer needed to hold the whole
|
|
message (including terminating NUL).
|
|
If
|
|
.Fa errbuf_size
|
|
is 0,
|
|
.Fa errbuf
|
|
is ignored but the return value is still correct.
|
|
.Pp
|
|
If the
|
|
.Fa errcode
|
|
given to
|
|
.Fn regerror
|
|
is first ORed with
|
|
.Dv REG_ITOA ,
|
|
the
|
|
.Dq message
|
|
that results is the printable name of the error code,
|
|
e.g.\&
|
|
.Dq Dv REG_NOMATCH ,
|
|
rather than an explanation thereof.
|
|
If
|
|
.Fa errcode
|
|
is
|
|
.Dv REG_ATOI ,
|
|
then
|
|
.Fa preg
|
|
shall be
|
|
.No non\- Ns Dv NULL
|
|
and the
|
|
.Va re_endp
|
|
member of the structure it points to
|
|
must point to the printable name of an error code;
|
|
in this case, the result in
|
|
.Fa errbuf
|
|
is the decimal digits of
|
|
the numeric value of the error code
|
|
(0 if the name is not recognized).
|
|
.Dv REG_ITOA
|
|
and
|
|
.Dv REG_ATOI
|
|
are intended primarily as debugging facilities;
|
|
they are extensions,
|
|
compatible with but not specified by
|
|
.St -p1003.2 ,
|
|
and should be used with
|
|
caution in software intended to be portable to other systems.
|
|
Be warned also that they are considered experimental and changes are possible.
|
|
.Pp
|
|
The
|
|
.Fn regfree
|
|
function
|
|
frees any dynamically-allocated storage associated with the compiled RE
|
|
pointed to by
|
|
.Fa preg .
|
|
The remaining
|
|
.Ft regex_t
|
|
is no longer a valid compiled RE
|
|
and the effect of supplying it to
|
|
.Fn regexec
|
|
or
|
|
.Fn regerror
|
|
is undefined.
|
|
.Pp
|
|
None of these functions references global variables except for tables
|
|
of constants;
|
|
all are safe for use from multiple threads if the arguments are safe.
|
|
.Sh IMPLEMENTATION CHOICES
|
|
There are a number of decisions that
|
|
.St -p1003.2
|
|
leaves up to the implementor,
|
|
either by explicitly saying
|
|
.Dq undefined
|
|
or by virtue of them being
|
|
forbidden by the RE grammar.
|
|
This implementation treats them as follows.
|
|
.Pp
|
|
See
|
|
.Xr re_format 7
|
|
for a discussion of the definition of case-independent matching.
|
|
.Pp
|
|
There is no particular limit on the length of REs,
|
|
except insofar as memory is limited.
|
|
Memory usage is approximately linear in RE size, and largely insensitive
|
|
to RE complexity, except for bounded repetitions.
|
|
See
|
|
.Sx BUGS
|
|
for one short RE using them
|
|
that will run almost any system out of memory.
|
|
.Pp
|
|
A backslashed character other than one specifically given a magic meaning
|
|
by
|
|
.St -p1003.2
|
|
(such magic meanings occur only in obsolete
|
|
.Bq Dq basic
|
|
REs)
|
|
is taken as an ordinary character.
|
|
.Pp
|
|
Any unmatched
|
|
.Ql [\&
|
|
is a
|
|
.Dv REG_EBRACK
|
|
error.
|
|
.Pp
|
|
Equivalence classes cannot begin or end bracket-expression ranges.
|
|
The endpoint of one range cannot begin another.
|
|
.Pp
|
|
.Dv RE_DUP_MAX ,
|
|
the limit on repetition counts in bounded repetitions, is 255.
|
|
.Pp
|
|
A repetition operator
|
|
.Ql ( ?\& ,
|
|
.Ql *\& ,
|
|
.Ql +\& ,
|
|
or bounds)
|
|
cannot follow another
|
|
repetition operator.
|
|
A repetition operator cannot begin an expression or subexpression
|
|
or follow
|
|
.Ql ^\&
|
|
or
|
|
.Ql |\& .
|
|
.Pp
|
|
.Ql |\&
|
|
cannot appear first or last in a (sub)expression or after another
|
|
.Ql |\& ,
|
|
i.e., an operand of
|
|
.Ql |\&
|
|
cannot be an empty subexpression.
|
|
An empty parenthesized subexpression,
|
|
.Ql "()" ,
|
|
is legal and matches an
|
|
empty (sub)string.
|
|
An empty string is not a legal RE.
|
|
.Pp
|
|
A
|
|
.Ql {\&
|
|
followed by a digit is considered the beginning of bounds for a
|
|
bounded repetition, which must then follow the syntax for bounds.
|
|
A
|
|
.Ql {\&
|
|
.Em not
|
|
followed by a digit is considered an ordinary character.
|
|
.Pp
|
|
.Ql ^\&
|
|
and
|
|
.Ql $\&
|
|
beginning and ending subexpressions in obsolete
|
|
.Pq Dq basic
|
|
REs are anchors, not ordinary characters.
|
|
.Sh DIAGNOSTICS
|
|
Non-zero error codes from
|
|
.Fn regcomp
|
|
and
|
|
.Fn regexec
|
|
include the following:
|
|
.Pp
|
|
.Bl -tag -width REG_ECOLLATE -compact
|
|
.It Dv REG_NOMATCH
|
|
The
|
|
.Fn regexec
|
|
function
|
|
failed to match
|
|
.It Dv REG_BADPAT
|
|
invalid regular expression
|
|
.It Dv REG_ECOLLATE
|
|
invalid collating element
|
|
.It Dv REG_ECTYPE
|
|
invalid character class
|
|
.It Dv REG_EESCAPE
|
|
.Ql \e
|
|
applied to unescapable character
|
|
.It Dv REG_ESUBREG
|
|
invalid backreference number
|
|
.It Dv REG_EBRACK
|
|
brackets
|
|
.Ql "[ ]"
|
|
not balanced
|
|
.It Dv REG_EPAREN
|
|
parentheses
|
|
.Ql "( )"
|
|
not balanced
|
|
.It Dv REG_EBRACE
|
|
braces
|
|
.Ql "{ }"
|
|
not balanced
|
|
.It Dv REG_BADBR
|
|
invalid repetition count(s) in
|
|
.Ql "{ }"
|
|
.It Dv REG_ERANGE
|
|
invalid character range in
|
|
.Ql "[ ]"
|
|
.It Dv REG_ESPACE
|
|
ran out of memory
|
|
.It Dv REG_BADRPT
|
|
.Ql ?\& ,
|
|
.Ql *\& ,
|
|
or
|
|
.Ql +\&
|
|
operand invalid
|
|
.It Dv REG_EMPTY
|
|
empty (sub)expression
|
|
.It Dv REG_ASSERT
|
|
cannot happen - you found a bug
|
|
.It Dv REG_INVARG
|
|
invalid argument, e.g.\& negative-length string
|
|
.It Dv REG_ILLSEQ
|
|
illegal byte sequence (bad multibyte character)
|
|
.El
|
|
.Sh SEE ALSO
|
|
.Xr grep 1 ,
|
|
.Xr re_format 7
|
|
.Pp
|
|
.St -p1003.2 ,
|
|
sections 2.8 (Regular Expression Notation)
|
|
and
|
|
B.5 (C Binding for Regular Expression Matching).
|
|
.Sh HISTORY
|
|
Originally written by
|
|
.An Henry Spencer .
|
|
Altered for inclusion in the
|
|
.Bx 4.4
|
|
distribution.
|
|
.Sh BUGS
|
|
This is an alpha release with known defects.
|
|
Please report problems.
|
|
.Pp
|
|
The back-reference code is subtle and doubts linger about its correctness
|
|
in complex cases.
|
|
.Pp
|
|
The
|
|
.Fn regexec
|
|
function
|
|
performance is poor.
|
|
This will improve with later releases.
|
|
The
|
|
.Fa nmatch
|
|
argument
|
|
exceeding 0 is expensive;
|
|
.Fa nmatch
|
|
exceeding 1 is worse.
|
|
The
|
|
.Fn regexec
|
|
function
|
|
is largely insensitive to RE complexity
|
|
.Em except
|
|
that back
|
|
references are massively expensive.
|
|
RE length does matter; in particular, there is a strong speed bonus
|
|
for keeping RE length under about 30 characters,
|
|
with most special characters counting roughly double.
|
|
.Pp
|
|
The
|
|
.Fn regcomp
|
|
function
|
|
implements bounded repetitions by macro expansion,
|
|
which is costly in time and space if counts are large
|
|
or bounded repetitions are nested.
|
|
An RE like, say,
|
|
.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
|
|
will (eventually) run almost any existing machine out of swap space.
|
|
.Pp
|
|
There are suspected problems with response to obscure error conditions.
|
|
Notably,
|
|
certain kinds of internal overflow,
|
|
produced only by truly enormous REs or by multiply nested bounded repetitions,
|
|
are probably not handled well.
|
|
.Pp
|
|
Due to a mistake in
|
|
.St -p1003.2 ,
|
|
things like
|
|
.Ql "a)b"
|
|
are legal REs because
|
|
.Ql )\&
|
|
is
|
|
a special character only in the presence of a previous unmatched
|
|
.Ql (\& .
|
|
This cannot be fixed until the spec is fixed.
|
|
.Pp
|
|
The standard's definition of back references is vague.
|
|
For example, does
|
|
.Ql "a\e(\e(b\e)*\e2\e)*d"
|
|
match
|
|
.Ql "abbbd" ?
|
|
Until the standard is clarified,
|
|
behavior in such cases should not be relied on.
|
|
.Pp
|
|
The implementation of word-boundary matching is a bit of a kludge,
|
|
and bugs may lurk in combinations of word-boundary matching and anchoring.
|
|
.Pp
|
|
Word-boundary matching does not work properly in multibyte locales.
|