Paschal asked me to find a simple solution for stripping HTML tags from a given string using Regular expressions.
The solution is quite simple:
1. Retrieve all the HTML tags using this pattern: <(.|\n)*?>
2. Replace them with an empty string and return the result
Here's a C# function that does this:
private string StripHTML(string htmlString){
//This pattern Matches everything found inside html tags;//(.|\n) - > Look for any character or a new line
// *? -> 0 or more occurences, and make a non-greedy search meaning
//That the match will stop at the first available '>' it sees, and not at the last one
//(if it stopped at the last one we could have overlooked
//nested HTML tags inside a bigger HTML tag..)
// Thanks to Oisin and Hugh Brown for helping on this one...
string pattern = @"<(.|\n)*?>";
return Regex.Replace(htmlString,pattern,string.Empty);
}
Or with just one line of code:
string
stripped = Regex.Replace(textBox1.Text,@"<(.|\n)*?>",string.Empty);