Character sets are the fundamental elements in a regular expression. A character set is a pattern that matches a single character. The syntax of character sets is as follows:
set := set '#' set0 | set0 set0 := @char [ '-' @char ] | '.' | @smac | '[' [^] { set } ']' | '~' set0
The various character set constructions are:
char
The simplest character set is a single Unicode character.
Note that special characters such as [
and .
must be escaped by prefixing them
with \
(see the lexical syntax, Section 3.1, “Lexical syntax”, for the list of special
characters).
Certain non-printable characters have special escape
sequences. These are: \a
,
\b
, \f
,
\n
, \r
,
\t
, and \v
. Other
characters can be represented by using their numerical
character values (although this may be non-portable):
\x0A
is equivalent to
\n
, for example.
Whitespace characters are ignored; to represent a
literal space, escape it with \
.
char
-char
A range of characters can be expressed by separating
the characters with a ‘-
’,
all the characters with codes in the given range are
included in the set. Character ranges can also be
non-portable.
.
The built-in set ‘.
’
matches all characters except newline
(\n
).
Equivalent to the set
[\x00-\x10ffff] # \n
.
set0
# set1
Matches all the characters in
set0
that are not in
set1
.
[sets
]
The union of sets
.
[^sets
]
The complement of the union of the
sets
. Equivalent to
‘. # [
’.sets
]
~set
The complement of set
.
Equivalent to ‘. #
’set
A set macro is written as $
followed by
an identifier. There are some builtin character set
macros:
$white
Matches all whitespace characters, including newline.
Equivalent to the set
[\ \t\n\f\v\r]
.
$printable
Matches all "printable characters". Currently this
corresponds to Unicode code points 32 to 0x10ffff,
although strictly speaking there are many non-printable
code points in this region. In the future Alex may use a
more precise definition of $printable
.
Character set macros can be defined at the top of the file at the same time as regular expression macros (see Chapter 4, Regular Expression). Here are some example character set macros:
$lls = a-z -- little letters $not_lls = ~a-z -- anything but little letters $ls_ds = [a-zA-Z0-9] -- letters and digits $sym = [ \! \@ \# \$ ] -- the symbols !, @, #, and $ $sym_q_nl = [ \' \! \@ \# \$ \n ] -- the above symbols with ' and newline $quotable = $printable # \' -- any graphic character except ' $del = \127 -- ASCII DEL