INTRODUCING REGULAR EXPRESSIONS

INTRODUCING REGULAR EXPRESSIONS

Regular expressions provide a means for matching strings and characters in order to obtain only the desired information (for example, from a large input list). Also called regexps, regular expressions can be very powerful and complex; however, they don't need to be.

For the most part, regexps are universal, but there are differences between implementations. For instance, regexps used on the command line or with grep may be slightly different than those used by Perl. The following are some of the character sequences used in regexps:

*Matches any single character

^ Matches the empty string at the beginning of a line

$ Matches the empty string at the end of a line

\< Matches the empty string at the beginning of a word

\> Matches the empty string at the end of a word

? Matches the preceding item at most once

* Matches the preceding item zero or more times

+ Matches the preceding item at least once

[x] Matches any character in the brackets

If, for example, you had a list of files and wanted to find only those files starting with an alphabetical character and ending in ".bz2" or ".gz", you might use:

# grep -E '\<[a-z].*\.(bz2|gz)\>' myfile

This executes grep against the file "myfile" with extended regexpoptions enabled (the -E command). The regexp here is:

\<[a-z].*\.(bz2|gz)\>

This tells grep to look for a word that starts (\<) with an alphabetical character ([a-z]) and is repeated any number of times (.*) until a period is encountered and then the string "bz2" OR "gz" is matched (\.(bz2|gz)) at the end of the word (\>). You'll also notice that, because the period character is also a regexp, the period character is 'escaped' with a backslash so that it is used literally.