Roy Osherove

View Original

Q&A - Greedy matching in regular expressions

This came in the mail, thought other folks might be interested.

Hi Roy. I need to check a line of html and make the value of the style attribute lowercase.  I've tried to come up with a regex that will work but I keep making the entire line of html lowercase instead of just the stuff in the style value.  I can't get the match to end with the correct quote, instead it goes to the last quote on the line.  So something like this:

[Tag style="WIDTH:20px; color:blue;" href="blah.com/PageTWO"] I want to change to this:
[Tag style="width:20px; color:blue;" href="blah.com/PageTWO"]

But instead I get this:
[Tag style="width:20px; color:blue;" href="blah.com/pagetwo"]

Because the match ends with the end quote of the href.


If you can point me in the right direction (or having something like this laying around), I would GREATLY appreciate it.

Answer:

It's called "greedy matching" - because it looks for the *last* character.
Try to add a "?" after the quanitiy specifier (probably '*'). That makes the match end on the *first* match.

For example, given the following string as input:
"abcdfgdrbdtargd"
The following greedy regex (greedy by default) will match up until the lasd 'd':
(.*d)

However, this regex will find several matches, the first one is "abcd":
(.*?d)
(you can do without the braces if you want).

I'd also suggest adding two good regex mailing list to your arsenal instead of sending help messages to various people:
http://groups.yahoo.com/group/dotnetregex/
http://lists.aspadvice.com/SignUp/list.aspx?l=68&c=16

There are people there that know a whole lot more than me on regular expressions.