26. August 2006, 18:04 | by WD Milner | Full Article |

The power and usefulness of good notation reaches beyond traditional programming languages. Regular expressions are one of the most broadly applicable specialized “languages”, a compact and expressive notational format for describing patterns of text. Besides being interesting from an algorithmic viewpoint, they are quite useful and , in their simpler forms, easy to implement.

Regular expressions come in several varieties. Perhaps the most familiar is the so called “wildcards” used on command line processes to match patterns of filenames and similar data. Typically, a “*” is taken to mean “any string of characters”. Similarly, “?” is sometimes used to indicated “any single character”.

Regular expressions are pervasive in the UNIX world (including Linux) and can be found in editors, tools, and scripting languages. While the variations between programs might suggest that they are an ad hoc mechanism, they are a language in the technical sense with a formal grammar that specifies the structure and a precise meaning that can be attached to each formulation. A good implementation can run extremely fast as well. Perhaps the widest used program to make use of regular expressions is grep.

A regular expression (often just called “regex”) is a sequence of characters that defines a pattern. Most characters in the pattern match to themselves in a target string. A few characters are used in patterns as meta-characters to indicate grouping, positioning or repitition.

In POSIX regular expressions, “^” represents the start of a string and “$” the end. Thus “^x” matches x only at the beginning of a string, “$x” at the end of a string and “^x$” if it is the only character in the string. “^$” represents an empty string.

The meta-character “.” matches any single character. Thus “x.y” matches “xay”, “x1y”, etc. but not, for example, “xy” or “xyxyxy”. Combined with the two meta-characters from the previous paragraph, “^.$” represents any single character string.

A set of characters placed inside “[]” matches any single character of the enclosed group. For example, “[0123456789]” matches a single digit. This could also be abbreviated “[0-9]”.

All these can be build upon using parentheses for grouping, “|” for alternatives, “*” for zero or more occurrences, “+” for one or more occurrences and “?” for zero or one occurrences. Additionally, “\” can be used as a prefix to quote a meta-character and turn off its special meaning.

These can be combined into quite specific and complex patterns. As an example, “\.[0-9]+” matches a period followed by one or more digits; “[0-9]+\.?[0-9]*” matches one or more digits followed by an optional period and zero or more further digits; “(\+|-)” matches a plus or minus sign; “[eE](\+|-)?[0-9]+” matches an “e” or “E” followed by an optional plus or minus sign and one or more digits. All of these can be combined into a single pattern expression that matches floating point numbers:


Full regular expression uses would also include classes such as “[a-zA-Z]” to match a single alphabetic character.

Most systems include a regular expression library, usally called regex or regexp, for including in programs but if it is not available it is releatively easy to implement your own for a moderate subset of the full regular expression vocabulary.

- 30 -

Categories: ,
Keywords: regex,regular expressions,POSIX,UNIX,Linux,grep,matching



Textile help
* Indicates a required field.

As a SPAM prevention measure, comments are moderated and will be posted once vetted.


Article & Comments

Comments are not enabled for all articles or documents.

Article Navigation


Internet and WWW
Music and Audio
Society and Culture
Stage and Screen
Tips and Tricks
Web Design
Web Site

The Birches - Support Child Safety Online


 Help to FIGHT spam!