Roy Osherove

View Original

Practical Parsing Using Groups in Regular Expressions

Practical Parsing Using Groups in Regular Expressions

 

note: This article is part 2 in a series of articles. Here are the rest:

What we’ll cover:

 

What’s you’ll need:

  • If you don’t know what regular expressions are then read this article.
  • eXpresso   (If you are not familiar with eXpresso, See this article)
  • .Net Framework SDK (Visual Studio .Net preferred)

 

What are regular expression groups and why do I need them?

In order to explain that, let’s take a look at a simple example.

Open eXpresso. Clear both the data pane, and the expression pane.

Next, put the following string in the data pane:

 

My birthday is 1 7/05/1975 . Thank you.

 

From this string, we would like to search and extract any dates that appear there.

Try coming up with an expression that matches this by yourself.

If you want it the easy way, here’s an expression that’ll work:

\d{2}/\d{2}/\d{4}

 

This expression expects 2 digits, then a slash, then 2 more digits followed by a slash followed by 4 digits.

Granted, you could make this expression a bit more flexible and efficient but let’s keep it simple for the purpose of the article.

 

Now, when you press the “Find Matches” button, you’ll get the string that matches the specified date. Good.

However, in the real world, we would use this date inside our code. Let’s say we have a function that receives this date and wants to get the month, date and year as separate values in order to do various tasks with it.

One solution would obviously be to use the DateTime Class found in the framework to parse this string. However, let’s try doing it another way.

 

Assuming we don’t have the ability to use regular expressions, we would have to use standard text parsing functionality in order to determine the location of the first and second slashes, and then retrieve the strings that exist between them.

This mundane task can be avoided easily. The solution is simple, and very powerful.

The Regex Object model allows us to define Named groups within the specified regular expression.

These groups are exposed using the “Match.Groups” property.

We can get each group by name, and get its value using the Group.Value property.

 

Let’s take a look at a modified expression, which divides the parsed date into sub-groups:

To add a group to an expression, simply enclose the part of the expression you would like to be divided with round braces.

For example, In order to divide the day section of the date, we would do this:

(\d{2})/\d{2}/\d{4}

 

Let’s take a look at the expression after we have divided all 3 desired sub-groups:

(\d{2})/(\d{2})/(\d{4})

 

Paste the last expression into eXpresso and press “Find Matches”.

You’ll see that you get the same output as before, but something is different: There’s a little “+” mark on the left!.

Click on the “+” to expand the result. You’ll see that there are 3 sub groups below this global result.

These sub groups each represent a “Group” object, which is part of the received “Match” Object.

This tree view represents perfectly the hierarchical relationship of the object model.

 

Notice, though, that the groups are not currently named; we haven’t named them yet; they are indexed by numbers starting from 1 by default.

This means that you can call each group in code by specifying its index. Let’s name those groups to make them easier to call in code.

The syntax to name a group is simply to add the following after the opening brace of the sub group:

?

 

Note: The name you provide is Case-sensitive!

Let’s see how the final expression looks  after naming the groups:

(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})

 

Pretty easy, right?

Paste this expression into eXpresso and click “Find Matches”.

You’ll get an expandable result again, but this time you’ll have names instead of index numbers.

Now you’ll be able to call each groups value by just using the group’s name.

 

Simple Code Demo

//This function will receive

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //this is the pattern we'll use to match the date

                  //and then divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  Match DateMatch = Regex.Match(date,pattern);

           

                  //make sure there's actually a date in the string

                  //we get a Match object anyway,

                  //so we have to test it's 'Success' property;

                  if(!DateMatch.Success)

                  {

                        MessageBox.Show("Could not find a date inside the string");

                        return;

                  }

                 

 

                  //Print the value of the global match result

                  listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                  //Get Each sub-group by name and print it's value

                  //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

                  listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                  listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                  listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

            }

 

 

As you can see, the functionality is pretty straight forward.

In order to get a sub group of the match result we simply call “Match.Groups[GroupName].Value” to get its value.

 

Using Multiple Matches from a given string

In order to provide you with the ability to get to all the recieved matches, the Match Object has a “NextMatch() function, which returns a new Match Object.

You’ll need to test it for Success value again. All you need to do is keep going until the Match.Sucess value is False.

 

The Hard Way

Here’s the same method from before, implemented to go through all the matches:

 

      //This function will recieve

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //the pattern we'll use to match the date

                  //and divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  Match DateMatch = Regex.Match(date,pattern);

           

                  //make sure there's actually a date in the string

                  //we get a Match object anyway,

                  //so we have to test it's 'Success' property;

                  if(!DateMatch.Success)

                  {

                        MessageBox.Show("Could not find a date inside the string");

                        return;

                  }

 

                  //Iterate through all the parsing Matches and print them

                  while (DateMatch.Success)

                  {

                        //Print the value of the global match result

                        listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                        //Get Each sub-group by name and print it's value

                        //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

 

                        listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                        listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                        listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

 

 

                        DateMatch = DateMatch.NextMatch();

                  }

 

            }

 

Handling multiple matches – the simple way

The last example was a bit cumbersome. There’s another way to go through all the matches –

Simply use the Regex.Matches() function instead of the Regex.Match() function.

Then simply iterate over each match (You don’t even have to check for success, since you’ll only receive successful matches)

 

//This function will recieve

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //the patern we'll use to match the date

                  //and divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  MatchCollection DateMatches = Regex.Matches(date,pattern);

                 

                  //notice there is no need to check for success here.

                  //we only get successfull matches from this function..

 

                  //Iterate through all the parsing Matches and print them

                  foreach(Match DateMatch in DateMatches)

                  {

                        //Print the value of the current match result

                        listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                        //Get Each sub-group by name and print it's value

                        //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

 

                        listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                        listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                        listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

                  }

                 

 

            }

 

Conclusion

Using groups with regular expression is a powerful tool to add to your parsing arsenal. There are many other abilities to the Regex, but this one is probably the most important, since most of the more advanced abilities of the Regex rely on this functionality.

In (perhaps) future articles, I will explain more in-depth possibilities of regular expression in the .Net Framework.