Follow @RoyOsherove on Twitter

[New article]: Creating a generic Site-To-RSS tool (site scraping)

It's been a long and very interesting week for me. I know it's weird hearing that the week I start looking for a new job, but it's true. I've been getting lots of interesting offers from people, especially from the weblog scene. I'm still looking, though, so keep'em coming. If you want me - tell me! (hint, Microsoft!)

The past week has given me some time to relax and start doing the things I really love - late night coding and discoveries. That's exactly the case with the article I'm presenting here. I was sitting at my machine, browsing .NetWire when I suddenly thought, “gee I wish they had an RSS feed!”. Then I started playing around with some regular expressions using eXpresso, and it dawned on me that it would be real easy to engineer a generic solution for such a problem. So I did. And here's the product of this work; The longest and most detailed article I've ever done (about 5,000 words - personal record). Yeah, length doesn't matter, content does, but there's a lot of content in there. This really is something I plan to pursue further. Thanks to Mike Gunderloy for helping me out on a nasty bug I couldn't figure out. You saved me, buddy!

Anyway, I'll stop rambling and give you the juice:

Creating a generic Site-To-RSS tool


I’ll show how to use regular expressions to parse a web page’s HTML text into manageable chunks of data. That data will be converted and written as an RSS feed for the whole world to consume. Finally, I’ll show how to create a generic tool that allows you to automatically generate an RSS feed from any given website, given a small group of parameters. At the end of the day we will have a working RSS feed for .


Read it. Have fun. Be productive.

Disclaimer: Site scraping personal use only! They already have an RSS feed here (which was mentioned to me only after the article was finished)


Just realized this is the 200th article on .NetWeblogs! Is there a prize? :)



Well, ScottW posted that:

1. They already have a feed (so joke is on me!)

2. Scraping a site is stealing page views! so do it for personal pleasure only!


Engineering Notebook: An Extreme Programming Episode

Creating a generic Site-To-RSS tool