322 lines
		
	
	
		
			8.9 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			322 lines
		
	
	
		
			8.9 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
.\" Copyright (c) 1991, 1993
 | 
						|
.\"	The Regents of the University of California.  All rights reserved.
 | 
						|
.\"
 | 
						|
.\" Redistribution and use in source and binary forms, with or without
 | 
						|
.\" modification, are permitted provided that the following conditions
 | 
						|
.\" are met:
 | 
						|
.\" 1. Redistributions of source code must retain the above copyright
 | 
						|
.\"    notice, this list of conditions and the following disclaimer.
 | 
						|
.\" 2. Redistributions in binary form must reproduce the above copyright
 | 
						|
.\"    notice, this list of conditions and the following disclaimer in the
 | 
						|
.\"    documentation and/or other materials provided with the distribution.
 | 
						|
.\" 3. All advertising materials mentioning features or use of this software
 | 
						|
.\"    must display the following acknowledgement:
 | 
						|
.\"	This product includes software developed by the University of
 | 
						|
.\"	California, Berkeley and its contributors.
 | 
						|
.\" 4. Neither the name of the University nor the names of its contributors
 | 
						|
.\"    may be used to endorse or promote products derived from this software
 | 
						|
.\"    without specific prior written permission.
 | 
						|
.\"
 | 
						|
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 | 
						|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 | 
						|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 | 
						|
.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 | 
						|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 | 
						|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 | 
						|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 | 
						|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 | 
						|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 | 
						|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 | 
						|
.\" SUCH DAMAGE.
 | 
						|
.\"
 | 
						|
.\"     @(#)regexp.3	8.1 (Berkeley) 6/4/93
 | 
						|
.\"
 | 
						|
.Dd June 4, 1993
 | 
						|
.Dt REGEXP 3
 | 
						|
.Os
 | 
						|
.Sh NAME
 | 
						|
.Nm regcomp ,
 | 
						|
.Nm regexec ,
 | 
						|
.Nm regsub ,
 | 
						|
.Nm regerror
 | 
						|
.Nd regular expression handlers
 | 
						|
.Sh SYNOPSIS
 | 
						|
.Fd #include <regexp.h>
 | 
						|
.Ft regexp *
 | 
						|
.Fn regcomp "const char *exp"
 | 
						|
.Ft int
 | 
						|
.Fn regexec "const regexp *prog" "const char *string"
 | 
						|
.Ft void
 | 
						|
.Fn regsub "const regexp *prog" "const char *source" "char *dest"
 | 
						|
.Sh DESCRIPTION
 | 
						|
.Bf -symbolic
 | 
						|
This interface is made obsolete by
 | 
						|
.Xr regex 3 .
 | 
						|
It is available from the compatibility library, libcompat.
 | 
						|
.Ef
 | 
						|
.Pp
 | 
						|
The
 | 
						|
.Fn regcomp ,
 | 
						|
.Fn regexec ,
 | 
						|
.Fn regsub ,
 | 
						|
and
 | 
						|
.Fn regerror
 | 
						|
functions
 | 
						|
implement
 | 
						|
.Xr egrep 1 Ns -style
 | 
						|
regular expressions and supporting facilities.
 | 
						|
.Pp
 | 
						|
The
 | 
						|
.Fn regcomp
 | 
						|
function
 | 
						|
compiles a regular expression into a structure of type
 | 
						|
.Xr regexp ,
 | 
						|
and returns a pointer to it.
 | 
						|
The space has been allocated using
 | 
						|
.Xr malloc 3
 | 
						|
and may be released by
 | 
						|
.Xr free .
 | 
						|
.Pp
 | 
						|
The
 | 
						|
.Fn regexec
 | 
						|
function
 | 
						|
matches a
 | 
						|
.Dv NUL Ns -terminated
 | 
						|
.Fa string
 | 
						|
against the compiled regular expression
 | 
						|
in
 | 
						|
.Fa prog .
 | 
						|
It returns 1 for success and 0 for failure, and adjusts the contents of
 | 
						|
.Fa prog Ns 's
 | 
						|
.Em startp
 | 
						|
and
 | 
						|
.Em endp
 | 
						|
(see below) accordingly.
 | 
						|
.Pp
 | 
						|
The members of a
 | 
						|
.Xr regexp
 | 
						|
structure include at least the following (not necessarily in order):
 | 
						|
.Bd -literal -offset indent
 | 
						|
char *startp[NSUBEXP];
 | 
						|
char *endp[NSUBEXP];
 | 
						|
.Ed
 | 
						|
.Pp
 | 
						|
where
 | 
						|
.Dv NSUBEXP
 | 
						|
is defined (as 10) in the header file.
 | 
						|
Once a successful
 | 
						|
.Fn regexec
 | 
						|
has been done using the
 | 
						|
.Fn regexp ,
 | 
						|
each
 | 
						|
.Em startp Ns - Em endp
 | 
						|
pair describes one substring
 | 
						|
within the
 | 
						|
.Fa string ,
 | 
						|
with the
 | 
						|
.Em startp
 | 
						|
pointing to the first character of the substring and
 | 
						|
the
 | 
						|
.Em endp
 | 
						|
pointing to the first character following the substring.
 | 
						|
The 0th substring is the substring of
 | 
						|
.Fa string
 | 
						|
that matched the whole
 | 
						|
regular expression.
 | 
						|
The others are those substrings that matched parenthesized expressions
 | 
						|
within the regular expression, with parenthesized expressions numbered
 | 
						|
in left-to-right order of their opening parentheses.
 | 
						|
.Pp
 | 
						|
The
 | 
						|
.Fn regsub
 | 
						|
function
 | 
						|
copies
 | 
						|
.Fa source
 | 
						|
to
 | 
						|
.Fa dest ,
 | 
						|
making substitutions according to the
 | 
						|
most recent
 | 
						|
.Fn regexec
 | 
						|
performed using
 | 
						|
.Fa prog .
 | 
						|
Each instance of `&' in
 | 
						|
.Fa source
 | 
						|
is replaced by the substring
 | 
						|
indicated by
 | 
						|
.Em startp Ns Bq
 | 
						|
and
 | 
						|
.Em endp Ns Bq .
 | 
						|
Each instance of
 | 
						|
.Sq \e Ns Em n ,
 | 
						|
where
 | 
						|
.Em n
 | 
						|
is a digit, is replaced by
 | 
						|
the substring indicated by
 | 
						|
.Em startp Ns Bq Em n
 | 
						|
and
 | 
						|
.Em endp Ns Bq Em n .
 | 
						|
To get a literal `&' or
 | 
						|
.Sq \e Ns Em n
 | 
						|
into
 | 
						|
.Fa dest ,
 | 
						|
prefix it with `\e';
 | 
						|
to get a literal `\e' preceding `&' or
 | 
						|
.Sq \e Ns Em n ,
 | 
						|
prefix it with
 | 
						|
another `\e'.
 | 
						|
.Pp
 | 
						|
The
 | 
						|
.Fn regerror
 | 
						|
function
 | 
						|
is called whenever an error is detected in
 | 
						|
.Fn regcomp ,
 | 
						|
.Fn regexec ,
 | 
						|
or
 | 
						|
.Fn regsub .
 | 
						|
The default
 | 
						|
.Fn regerror
 | 
						|
writes the string
 | 
						|
.Fa msg ,
 | 
						|
with a suitable indicator of origin,
 | 
						|
on the standard
 | 
						|
error output
 | 
						|
and invokes
 | 
						|
.Xr exit 2 .
 | 
						|
The
 | 
						|
.Fn regerror
 | 
						|
function
 | 
						|
can be replaced by the user if other actions are desirable.
 | 
						|
.Sh REGULAR EXPRESSION SYNTAX
 | 
						|
A regular expression is zero or more
 | 
						|
.Em branches ,
 | 
						|
separated by `|'.
 | 
						|
It matches anything that matches one of the branches.
 | 
						|
.Pp
 | 
						|
A branch is zero or more
 | 
						|
.Em pieces ,
 | 
						|
concatenated.
 | 
						|
It matches a match for the first, followed by a match for the second, etc.
 | 
						|
.Pp
 | 
						|
A piece is an
 | 
						|
.Em atom
 | 
						|
possibly followed by `*', `+', or `?'.
 | 
						|
An atom followed by `*' matches a sequence of 0 or more matches of the atom.
 | 
						|
An atom followed by `+' matches a sequence of 1 or more matches of the atom.
 | 
						|
An atom followed by `?' matches a match of the atom, or the null string.
 | 
						|
.Pp
 | 
						|
An atom is a regular expression in parentheses (matching a match for the
 | 
						|
regular expression), a
 | 
						|
.Em range
 | 
						|
(see below), `.'
 | 
						|
(matching any single character), `^' (matching the null string at the
 | 
						|
beginning of the input string), `$' (matching the null string at the
 | 
						|
end of the input string), a `\e' followed by a single character (matching
 | 
						|
that character), or a single character with no other significance
 | 
						|
(matching that character).
 | 
						|
.Pp
 | 
						|
A
 | 
						|
.Em range
 | 
						|
is a sequence of characters enclosed in `[]'.
 | 
						|
It normally matches any single character from the sequence.
 | 
						|
If the sequence begins with `^',
 | 
						|
it matches any single character
 | 
						|
.Em not
 | 
						|
from the rest of the sequence.
 | 
						|
If two characters in the sequence are separated by `\-', this is shorthand
 | 
						|
for the full list of
 | 
						|
.Tn ASCII
 | 
						|
characters between them
 | 
						|
(e.g. `[0-9]' matches any decimal digit).
 | 
						|
To include a literal `]' in the sequence, make it the first character
 | 
						|
(following a possible `^').
 | 
						|
To include a literal `\-', make it the first or last character.
 | 
						|
.Sh AMBIGUITY
 | 
						|
If a regular expression could match two different parts of the input string,
 | 
						|
it will match the one which begins earliest.
 | 
						|
If both begin in the same place but match different lengths, or match
 | 
						|
the same length in different ways, life gets messier, as follows.
 | 
						|
.Pp
 | 
						|
In general, the possibilities in a list of branches are considered in
 | 
						|
left-to-right order, the possibilities for `*', `+', and `?' are
 | 
						|
considered longest-first, nested constructs are considered from the
 | 
						|
outermost in, and concatenated constructs are considered leftmost-first.
 | 
						|
The match that will be chosen is the one that uses the earliest
 | 
						|
possibility in the first choice that has to be made.
 | 
						|
If there is more than one choice, the next will be made in the same manner
 | 
						|
(earliest possibility) subject to the decision on the first choice.
 | 
						|
And so forth.
 | 
						|
.Pp
 | 
						|
For example,
 | 
						|
.Sq Li (ab|a)b*c
 | 
						|
could match
 | 
						|
`abc' in one of two ways.
 | 
						|
The first choice is between `ab' and `a'; since `ab' is earlier, and does
 | 
						|
lead to a successful overall match, it is chosen.
 | 
						|
Since the `b' is already spoken for,
 | 
						|
the `b*' must match its last possibility\(emthe empty string\(emsince
 | 
						|
it must respect the earlier choice.
 | 
						|
.Pp
 | 
						|
In the particular case where no `|'s are present and there is only one
 | 
						|
`*', `+', or `?', the net effect is that the longest possible
 | 
						|
match will be chosen.
 | 
						|
So
 | 
						|
.Sq Li ab* ,
 | 
						|
presented with `xabbbby', will match `abbbb'.
 | 
						|
Note that if
 | 
						|
.Sq Li ab* ,
 | 
						|
is tried against `xabyabbbz', it
 | 
						|
will match `ab' just after `x', due to the begins-earliest rule.
 | 
						|
(In effect, the decision on where to start the match is the first choice
 | 
						|
to be made, hence subsequent choices must respect it even if this leads them
 | 
						|
to less-preferred alternatives.)
 | 
						|
.Sh RETURN VALUES
 | 
						|
The
 | 
						|
.Fn regcomp
 | 
						|
function
 | 
						|
returns
 | 
						|
.Dv NULL
 | 
						|
for a failure
 | 
						|
.Pf ( Fn regerror
 | 
						|
permitting),
 | 
						|
where failures are syntax errors, exceeding implementation limits,
 | 
						|
or applying `+' or `*' to a possibly-null operand.
 | 
						|
.Sh SEE ALSO
 | 
						|
.Xr ed 1 ,
 | 
						|
.Xr ex 1 ,
 | 
						|
.Xr expr 1 ,
 | 
						|
.Xr egrep 1 ,
 | 
						|
.Xr fgrep 1 ,
 | 
						|
.Xr grep 1 ,
 | 
						|
.Xr regex 3
 | 
						|
.Sh HISTORY
 | 
						|
Both code and manual page for
 | 
						|
.Fn regcomp ,
 | 
						|
.Fn regexec ,
 | 
						|
.Fn regsub ,
 | 
						|
and
 | 
						|
.Fn regerror
 | 
						|
were written at the University of Toronto
 | 
						|
and appeared in
 | 
						|
.Bx 4.3 tahoe .
 | 
						|
They are intended to be compatible with the Bell V8
 | 
						|
.Xr regexp 3 ,
 | 
						|
but are not derived from Bell code.
 | 
						|
.Sh BUGS
 | 
						|
Empty branches and empty regular expressions are not portable to V8.
 | 
						|
.Pp
 | 
						|
The restriction against
 | 
						|
applying `*' or `+' to a possibly-null operand is an artifact of the
 | 
						|
simplistic implementation.
 | 
						|
.Pp
 | 
						|
Does not support
 | 
						|
.Xr egrep Ns 's
 | 
						|
newline-separated branches;
 | 
						|
neither does the V8
 | 
						|
.Xr regexp 3 ,
 | 
						|
though.
 | 
						|
.Pp
 | 
						|
Due to emphasis on
 | 
						|
compactness and simplicity,
 | 
						|
it's not strikingly fast.
 | 
						|
It does give special attention to handling simple cases quickly.
 |