Matching Text

A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).

Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. For example, the command:

$ grep [A-Z]* chap[12]

could be transformed by the shell into:

$ grep Array.c Bug.c Comp.c chap1 chap2

and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes as follows:

$ grep "[A-Z]*" chap[12]

Double quotes suffice in most cases, but single quotes are the safest bet.

Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.

Different metacharacters have different meanings, depending upon where they are used. In particular, regular expressions used for searching through text (matching) have one set of metacharacters, while the metacharacters used when processing replacement text have a different set. These sets also vary somewhat per program. This section covers the metacharacters used for searching and replacing, with descriptions of the variants in the different utilities.

The characters in the following table have special meaning only in search patterns:

Character

Pattern

.

Match any single character except newline. Can match newline in awk.

*

Match any number (or none) of the single character that immediately precedes it. The preceding character can also be a regular expression. For example, since . (dot) means any character, .* means "match any number of any character."

^

Match the following regular expression at the beginning of the line or string.

$

Match the preceding regular expression at the end of the line or string.

\

Turn off the special meaning of the following character.

[ ]

Match any one of the enclosed characters. A hyphen (-) indicates a range of consecutive characters. A circumflex (^) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or close bracket (]) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list (i.e., literally).

{ n,m }

Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a metacharacter. { n } matches exactly n occurrences; { n ,} matches at least n occurrences; and { n , m } matches any number of occurrences between n and m. n and m must be between 0 and 255, inclusive.

\{ n,m \}

Just like { n , m }, but with backslashes in front of the braces.

\( \)

Save the pattern enclosed between \( and \) into a special holding space. Up to nine patterns can be saved on a single line. The text matched by the subpatterns can be "replayed" in substitutions by the escape sequences \1 to \9.

\ n

Replay the nth sub-pattern enclosed in \( and \) into the pattern at this point. n is a number from 1 to 9, with 1 starting on the left.

\< \>

Match characters at beginning (\<) or end (\>) of a word.

+

Match one or more instances of preceding regular expression.

?

Match zero or one instances of preceding regular expression.

|

Match the regular expression specified before or after.

( )

Apply a match to the enclosed group of regular expressions.

Many Unix systems allow the use of POSIX character classes within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.

Some metacharacters are valid for one program but not for another. Those that are available to a Unix program are marked by a bullet (&bull;) in the following table. (This table is correct for SVR4 and Solaris and most commercial Unix systems, but it's always a good idea to verify your system's behavior.) Items marked with a "P" are specified by POSIX; double check your system's version. Full descriptions were provided in the previous section.

Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters listed in this table are meaningful only in a search pattern.

In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:

When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., ' pattern '.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns:

The following examples show the metacharacters available to sed or ex. Note that ex commands begin with a colon. A space is marked by a

Examples of searching and replacing

; a TAB is marked by a .

Finally, here are some sed examples for transposing words. A simple transposition of two words might look like this:

s/die or do/do or die/

The real trick is to use hold buffers to transpose variable patterns. For example, to transpose using hold buffers:

s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/