Thursday, September 22, 2011

Regular expressions table



The Perl You Need to Know: Basic Regular Expression Syntax
.
any single character The dot (.) can be used as a placeholder for any character. Examples:

"do." would match "dog", "dot", "doe", etc.
"d..r" would match "door" and "deer".
*
zero or more of the previous character The asterisk (*) specifies that zero or more instances of the previous character should exist in sequence. Examples:

"do.*" would match "dog", "done", "doppleganger", etc.
(why? "d-o- followed by zero or more of any chararcter")

"to*" would match "to" and "too"
(why? "t-o- followed by zero or more o's")

"fre*.." would match "frat", "free", "from"
(why? "f-r- followed by zero or more e's followed by any two characters)
+
one or more of the previous character The plus sign (+) demands that there be at least one of the previous character in sequence; similar to (*) but slightly more strict. Examples:

"fre+.." would match "freak", "freeze", "fresh"
(why? "f-r- followed by one or more e's followed by any two characters)
?
zero or one of the previous character The question mark (?) says that there should be zero or one of the previous character but not more than one. This is stricter than either (*) or (+). Examples:

"ton?e" would match "toe" and "tone"
(why? "t-o- followed by zero or one n followed by e")
( ) grouping The parentheses ( ) are used to group together patterns, for instance, to logically combine two or more patterns. Example:

(dog|cat) would match "dog" and "cat"
(why? "dog or cat")
[]
any character from the set The square brackets ([]) can be used as a placeholder for a single character which matches any of a set of characters. Confusing, at first, but some examples should clarify:

"ta[pb]" would match "tap" and "tab"
(why? "t-a- followed by one character from the set of pb")

"r[aeiou]t" would match "rat", "ret", "rot", "rut"
(why? "r- followed by one character from the set of vowels followed by t")

"r[aeiou]+t" would match "rat" (plus all of the above), "riot", "root", etc.
(why? "r- followed by one or more vowels followed by t")
[^]
any character not from the set Placing a carat (^) inside the square brackets ([]) negates the set; meaning the character must match any character not within the set. This is a useful way of specifying a large set of characters, for instance, consonants are "not vowels"; examples:

"t[^aeiou]+.*s" matches "thanks", "this", "trappings", etc.
(why? "t- followed by one or more of any character which is not a vowel followed by zero or more of any character followed by an s")
{min,max} range of occurrences The curly braces ({}) are used to require that the preceding character or set of characters occur a certain number of times. Examples:

"[a-z]{3}" would require that a lowercase letter appear 3 consecutive times.


"[0-9]{3,}" would require that a digit appear 3 or more consecutive times.

"[A-Z]{2,5}" would require that an uppercase letter appear between 2 and 5 consecutive times.


Character Classes Anchor Sequences
\d Any digit [0-9] ^ Beginning of data string
\D Any non-digit [^0-9] $ End of data string
\w Any alphanumeric [a-zA-Z0-9_] \b A word boundary
\W Any non-alphanumeric [^a-zA-Z0-9_] \B Any place except a word boundary
\s Any space [ \t\n\r\f]    
\S Any non-space [^ \t\n\r\f]    


Escape Sequences
\n Newline character, aka linefeed. This is the typical end-of-line character.
\r Carriage return character.
\t Tab character.
\e Escape character.
\xFF A hexadecimal value in place of "FF".