I’m not much in the way of thinking about various RSS standards, XML schemas and so on, so usually all the stuff that Sam Ruby writes about those matters gets blurred to me real fast. Maybe it’s because I’ve been so caught up with Feedable and trying to find new and exciting ways to scrape a site’s contents, but one of his latest posts struck a chord with me. A chord which says “Hey, I have an idea and it’s time someone laughed at it!”. So here I am, preparing to write this down.
So basically, what Sam talks about in there is how he and some of the guys were thinking about providing some sort of a “Feed Discovery Markup Language” (FDML? Ugh –sounds awful but what do I know?) for sites that wish to be able to describe their various syndication channels (RSS feeds for now..).
I wanted to take a look at it from a different angle. Let’s turn this whole thing around.
Looking at the subject matter from afar, we can see that we basically have several basic “entities” in this space:
· Data Provider: The site/ organization that would like to expose data to the outside world.
· Data: This is the data that the site wants to expose. Currently it is represented as HTML.
· View: This is a view of the data. A browser provides a view of the provider’s data. An RSS aggregator is yet another view of that same data.
Now, you could say that HTML is a sort of a “view” on the data, and that RSS is a different view on that data. What Sam is saying is “Let’s create a method for describing the RSS view that the provider exposes”. Looking at how Feedable works:
- Feedable treats the site’s HTML – one of the view, as actual data.
- It looks at the “data” with one simple variable in place – a regular expression that describes how to extract a different “view” from that ‘data’.
- It then creates that new view using that description, and what we get at the end is:
- HTML is a “data” with two unique “views” on it: the HTML itself, and the RSS feed that was generated from it..
Notice that in no time, is there a need for the site to generate yet another “view” of it’s underlying data. No RSS feed is created. In fact, the task of creating an RSS feed is left solely to the client- Feedable site.
So, what would happen if we decided the following?
· Instead of generating an RSS feed, each site would actually generate a small, static description file
o The file will hold the necessary regular expression that is required to parse it’s “contents page”
§ The file (regular expression) will expose a know set of “fields” and how to “scrape” them from that page. Possible fields: “Date, data, title..” and so on. Mainly based on what an RSS feed will show now. Consider that comments can also be described – but I think it might get a little complicated at this stage, and I’ll admit I hadn’t thought much past the first stage.
§ The file will also expose the address of the page which is needed for scraping
§ The file will be the only source that a reader will need in order to start getting that information
o The contents page will be the page on the site that holds the relevant material that the site wants to expose both as HTML and as any other protocol
o Each aggregator will be responsible for generating the RSS feed given the site’s data description file
o In fact – if we know how to get the raw data – RSS is only one option. We can easily transform this raw data into anything else.
- A site will only have to create a small description file(low expenses and investment), and rely on clients to be able to parse its data correctly according to the rules specified in that file.
- Protocol independent; We don’t need to know RSS, XML or what not to get the information we need. We get it raw using regular expressions on simple downloaded text. The KISS principle at work.
- That description file can grow to include whatever information we need – such as multiple data sources and so on.
- Describing comments, or anything that is “not right there” might be difficult. But there may be some elegant solution to this that I just can’t think of at 4.06AM.
- You are limited to what’s in your HTML.
- If your HTML “patterns” change – so must the regular expression. Considering various CSS styles and dynamic html, this might be a problem, but like everything else with standards, it relies on things being done “a certain way” to work.
- You are coupled to one view of your data. This is a bit of a rehash of the previous point, but I think it shows just how “breakage” of object oriented rules this is. RSS can be made as just another “view” on data, and a “view” should not depend on other views. That is exactly the case here.
I’m not sure just how viable this whole idea is, but it certainly works for Feedable. As such, I’d like to throw it “out there”, so that it can at least stimulate some sort of discussion. Consider it a “brain stormer” idea; Not necessarily viable, but it gets your mind going in different directions.