Roy Osherove

View Original

Introduction To Regular Expressions

Introduction to regular expressions

 Note: This is part 1 of a series of articles:

What will be covered

In this article I’ll demonstrate the use of the following:

·        Using the System.Text.RegularExpressions Namespace classes for text parsing

·        Using System.IO Namespace classes to read text files

 

Tools Used

·        Expresso – A tool for testing Regular Expressions

 

Introduction to regular expressions

Regular expressions (a.k.a “Regex”) are a very powerful text parsing language used widely in many parts of the technology industry.

Their main use is to find a particular pattern within a given string that matches whatever rules were expressed using this language.

In software applications, we will usually test whether the Regex Found a match for a pattern with the rules we defined, and based on the success of the search , perform certain actions such as validate an email address or finding HTTP link in a web page.

The Regex language can be used for both simple searches as well as for more complicated patterns such as finding all words in a given string that appear more then once within all the given text. Another use for Regex is to replace the string matching the pattern with another string (for example, making all lowercase words Uppercase, or replacing all replacing all “Five” to “5”).

 

A simple example would be to find a match for a pattern fitting an email address.

Let’s take a look at how an email address is built. Taking Royo@SomeWhere.com as an example:

We know there is a pattern here, since we know there will always be at least 5 parts:

-         A string

-         A “@” character

-         A string

-         A “.”

-         Another String

 

If you were to receive a string and determine whether it is a legitimate email address you would have several options:

·        Parse the string using cumbersome code for the existence of specific characters (which we mentioned above)

o       You need to check for hardcoded characters in your given string using String.IndexOf() and that sort of thing

o       You need to code lots of lines of code to perform complicated pattern searching (email pattern is very simple yet would take a pretty complicated “If” structure to apply correctly

·        Use the Regex Class found in the System.Text.RegularExpressions Namespace to check for an email pattern match

o        2 lines of code

 

 

A practical example to try

Installing and using Expresso

In order to practice the following example, you will need to download and install Expresso – A regular expression testing tool and much more. After you have installed it fire it up.

The main screen is divided into 3 main sections :

- The Regular expression input pane(upper part)

            This is where we try out different expressions and test for the parsing outcome in the output pane

- The Data Input Pane(Lower Left)

            The data to be parsed and tested. This is where we enter our sample data.

- The Output Pane (Lower right)

            The outcome of the Regex parsing engine.

Expresso starts up with a predefined sample expression already inside its expression pane.

You can tell by the "# Dates" inside  that this expression deals with parsing for dates inside text.

It already has some sample data filled in as well. To see what the parsing results look like, you can just click "Find Matches" button and see what happens.

The output pane is filled with all the strings that matched the particular pattern in the expression pane.

Notice that the results are also hierarchical, and can be expanded. This is another feature of Regex , which allows you to express groups to be captured and referred to by name, within an expression. But enough with this sample, since it is a bit to complicated for starters.

Let's do an easier sample, and I'll explain from the start.

 

Say we have some lines of text that we need to parse and retrieve email addresses from.

[Paste the following into the data pane in expresso]

This is a piece of text. It represents some email addresses.

For example royo@somewhere.com is an email address. Another one could be

Perhaps just junk@address.co.au or maybe it's @234.34.23.com.

For conclusion, here are some more junk email addresses. @. something@.

@somewhere. @.com something@.com @somwehre.com

This is the end of the text to be pasted into expresso.

 

 

 

Here’s a pattern to find a match for an email address using regular expressions :

\w*@\w*\.\w*((\.\w*)*)?

 

Like almost anything in languages, this can be expressed in more then one way, since the language itself contains many possibilities,

but let’s dissect this expression:

 

“\w “– An escape character meaning “Characters usually found in words”.

This means alpha numeric characters, but excludes spaces, punctuation marks and tabs.

Basically, anything that could be considered as being a non-word character is excluded.

 

*” – The asterisk sign means “0 or more instances of” and operates on the expression to the left of it.

Thus, “\w*” means “match zero or more occurrences of any word-like character”

 

@” – Match one instance of the character “@”

 

\w*” - “match zero or more occurrences of any word-like character”

 

\.” – Match one instance of the dot character (this requires an escape slash since “.” Is a reserved operator in the regex language)

 

\w*” - “match zero or more occurrences of any word-like character”

 

“()“ – Groups what is inside the round brackets into a logical group

 

(\.\w*)* - Match zero or more instances of a dot followed by a zero or more instances of any word-like character

()? – optional group – Match this if it appears in the text

 

According to the last 3 explanations here’s an overall explanation to the last part of the expression:

((\.\w*)*)? – means there might be zero or more instance of a dot followed by a word (“.something.anotherthing.bla” would be matched)

 

To conclude, here’s the description of this expression in whole:

Match any word followed by an asterisk , followed by any word followed by a dot ,followed by any word. Followed by zero or more instances of a dot and a word after it.

 

Paste the pattern mentioned above into the expression pane of expresso. Once you've done that, click the "find matches" button.

You can see in the output pane, that the Regex engine recovered only part of the data that appears in the data pane. It returned only strings which match the expression that was given.

Can you see a problem yet?

Can you find a logical problem with this expression?

These strings were matched as well, although they are not legitimate email addresses.

 

@.

something@.

@somewhere.

@.com

something@.com

@somwehre.com

 

 

The problem is we tested for zero or more instances of a character, so this string was accepted as well. We needed to specify “1 or more instances of” instead.

This is easily expressed using the “+” operator instead of the "*" operator.

Try to fix the expression yourself and hit "Find Matches again"..

 

Here’s the fixed expression to catch only valid instances

\w+@\w+\.\w+((\.\w+)*)?

This will catch only the two legitimate email addresses.

 

As you can see, It's pretty easy to work with regular expressions, once you know the syntax.

A table of syntax for Regex can be found here.

 

Code Demo

Here's the source code listing for a simple form , which reads a text file name "data.txt" found in it's directory.

Data.txt contains the same text as mentioned above , which you have put in the "data" pane of expresso.

The form reads the file, then uses the Regex class to get a MatchCollection object, which is a collection of Match Objects.

Each Match Object contains the value of the string that was parsed.


    
      
    

using System;

using System.Drawing;

using System.Collections;

using System.ComponentModel;

using System.Windows.Forms;

using System.Data;

//we add the following import staements

using System.Text.RegularExpressions;

using System.IO;

using System.Text;

.

.

.

            //This is the method we will call on form load
    

            private void ParseData()

            {

                  listBox1.Items.Clear();

 

                  //Read the data.txt file to parse for emails

                  string data = GetDataFromFile();

 

                  //the pattern used to parse for emails

                  //notice the '@' at the start to avoid escape special

//characters treated as escape sequences

                  string expression = @"\w+@\w+\.\w+((\.\w+)*)?";

 

                  //get a collection of successfull matches

                  //from Regex using the specified pattern on the input data

                  MatchCollection matches = Regex.Matches(data,expression);

 

                  //Iterate through the matches, adding them to the list box

                  foreach(Match SingleMatch in matches)

                  {

                        listBox1.Items.Add(SingleMatch.Value);

                  }

            }

 

            private string GetDataFromFile()

            {

                  //Use the Path Class to generate FIle paths from a given folder and a file name

                  string filename = Path.Combine(Application.StartupPath,"data.txt");

 

                  //open a text file for reading

                  StreamReader reader = File.OpenText(filename);

                  //get all the text in the file

                  string ret = reader.ReadToEnd();

 

                  //dismiss the file handle

                  reader.Close();

 

                  //return the text in the file

                  return ret;

            }

Conclusion

Using Regular expressions and the Regex object is very simple, and very powerful. Once you learn it, you'll use it all over the place – In search dialogs, Coding and sometimes even search engines. In future articles I will demonstrate more elaborate uses of regular expressions, and how they fit in an overall practical solution.

 

 

 

Appendix A

Here's a simple table displaying the syntax of a regular expression and any escape characters that can be used.

(You can find more on the syntax of Regex here.)

A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched.

The following table contains the complete list of metacharacters and their behavior in the context of regular expressions:

Character

Description

\

Marks the next character as a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".

^

Matches the position at the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position following '\n' or '\r'.

$

Matches the position at the end of the input string. If the RegExp object's Multiline property is set, $ also matches the position preceding '\n' or '\r'.

*

Matches the preceding character or subexpression zero or more times. For example, zo* matches "z" and "zoo". * is equivalent to {0,}.

+

Matches the preceding character or subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.

?

Matches the preceding character or subexpression zero or one time. For example, "do(es)?" matches the "do" in "do" or "does". ? is equivalent to {0,1}

{n}

n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food".

{n,}

n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.

{n,m}

m and n are nonnegative integers, where n <= m. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers.

?

When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string "oooo", 'o+?' matches a single "o", while 'o+' matches all 'o's.

.

Matches any single character except "\n". To match any character including the '\n', use a pattern such as '[\s\S].

(pattern)

Matches pattern and captures the match. The captured match can be retrieved from the resulting Matches collection, using the SubMatches collection in VBScript or the $0$9 properties in JScript. To match parentheses characters ( ), use '\(' or '\)'.

(?:pattern)

Matches pattern but does not capture the match, that is, it is a non-capturing match that is not stored for possible later use. This is useful for combining parts of a pattern with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'.

(?=pattern)

Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.

(?!pattern)

Negative lookahead matches the search string at any point where a string not matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but does not match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.

x|y

Matches either x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".

[xyz]

A character set. Matches any one of the enclosed characters. For example, '[abc]' matches the 'a' in "plain".

[^xyz]

A negative character set. Matches any character not enclosed. For example, '[^abc]' matches the 'p' in "plain".

[a-z]

A range of characters. Matches any character in the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.

[^a-z]

A negative range characters. Matches any character not in the specified range. For example, '[^a-z]' matches any character not in the range 'a' through 'z'.

\b

Matches a word boundary, that is, the position between a word and a space. For example, 'er\b' matches the 'er' in "never" but not the 'er' in "verb".

\B

Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never".

\cx

Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character.

\d

Matches a digit character. Equivalent to [0-9].

\D

Matches a nondigit character. Equivalent to [^0-9].

\f

Matches a form-feed character. Equivalent to \x0c and \cL.

\n

Matches a newline character. Equivalent to \x0a and \cJ.

\r

Matches a carriage return character. Equivalent to \x0d and \cM.

\s

Matches any white space character including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v].

\S

Matches any non-white space character. Equivalent to [^ \f\n\r\t\v].

\t

Matches a tab character. Equivalent to \x09 and \cI.

\v

Matches a vertical tab character. Equivalent to \x0b and \cK.

\w

Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'.

\W

Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'.

\xn

Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". Allows ASCII codes to be used in regular expressions.

\num

Matches num, where num is a positive integer. A reference back to captured matches. For example, '(.)\1' matches two consecutive identical characters.

\n

Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7).

\nm

Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captures, n is a backreference followed by literal m. If neither of the preceding conditions exist, \nm matches octal escape value nm when n and m are octal digits (0-7).

\nml

Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7).

\un

Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©).