Strip HTML tags from a string using regular expressions

Paschal asked me to find a simple solution for stripping HTML tags from a given string using Regular expressions.

The solution is quite simple:

1. Retrieve all the HTML tags using this pattern: <(.|\n)*?>

2. Replace them with an empty string and return the result

Here's a C# function that does this:

private string StripHTML(string htmlString)

{

//This pattern Matches everything found inside html tags;

//(.|\n) - > Look for any character or a new line

// *? -> 0 or more occurences, and make a non-greedy search meaning

//That the match will stop at the first available '>' it sees, and not at the last one

//(if it stopped at the last one we could have overlooked

//nested HTML tags inside a bigger HTML tag..)

// Thanks to Oisin and Hugh Brown for helping on this one...

string pattern = @"<(.|\n)*?>";

return Regex.Replace(htmlString,pattern,string.Empty);

}

Or with just one line of code:

string stripped = Regex.Replace(textBox1.Text,@"<(.|\n)*?>",string.Empty);

Show 11 comments

May 13 Strip HTML tags from a string using regular expressions