Search The Blog
About this site

@RoyOsherove

Subscribe!

This site aims to connect all the dots of my online activities - from tools, books blogs and twitter accounts, to upcoming conferences, engagements and user group talks.

from 5whys.com
Twitter: @RoyOsherove
My Book: The Art of Unit Testing
Latest Posts
« Engineering Notebook: An Extreme Programming Episode | Main | Creating a generic Site-To-RSS tool »
Sunday
Sep282003

[New article]: Creating a generic Site-To-RSS tool (site scraping)

It's been a long and very interesting week for me. I know it's weird hearing that the week I start looking for a new job, but it's true. I've been getting lots of interesting offers from people, especially from the weblog scene. I'm still looking, though, so keep'em coming. If you want me - tell me! (hint, Microsoft!)

The past week has given me some time to relax and start doing the things I really love - late night coding and discoveries. That's exactly the case with the article I'm presenting here. I was sitting at my machine, browsing .NetWire when I suddenly thought, “gee I wish they had an RSS feed!”. Then I started playing around with some regular expressions using eXpresso, and it dawned on me that it would be real easy to engineer a generic solution for such a problem. So I did. And here's the product of this work; The longest and most detailed article I've ever done (about 5,000 words - personal record). Yeah, length doesn't matter, content does, but there's a lot of content in there. This really is something I plan to pursue further. Thanks to Mike Gunderloy for helping me out on a nasty bug I couldn't figure out. You saved me, buddy!

Anyway, I'll stop rambling and give you the juice:

Creating a generic Site-To-RSS tool

Summary:

I’ll show how to use regular expressions to parse a web page’s HTML text into manageable chunks of data. That data will be converted and written as an RSS feed for the whole world to consume. Finally, I’ll show how to create a generic tool that allows you to automatically generate an RSS feed from any given website, given a small group of parameters. At the end of the day we will have a working RSS feed for www.DotNetWire.com .

 

Read it. Have fun. Be productive.

Disclaimer: Site scraping personal use only! They already have an RSS feed here (which was mentioned to me only after the article was finished)

P.S

Just realized this is the 200th article on .NetWeblogs! Is there a prize? :)

 

Update:

Well, ScottW posted that:

1. They already have a feed (so joke is on me!)

2. Scraping a site is stealing page views! so do it for personal pleasure only!

 

PrintView Printer Friendly Version

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>