NetNewsWire/Technotes/AvoidFeedParsing.markdown

2.3 KiB
Raw Blame History

How NetNewsWire Avoids Parsing Feeds

NetNewsWires code for reading feeds directly — not via a syncing system like Feedbin or Feedly) — does its best to avoid parsing feeds.

Heres the thing: parsing a feed means not just parsing the feed, it also means comparing the parsed version to whats in the database. This is all work, and part of performing well is to avoid work.

Heres what it does:

Conditional GET

I cant stress this strongly enough. When downloading a feed, NetNewsWire sends the appropriate headers to give the server a chance to respond with 304 Not Modified.

Its quite simple — read The Fishbowls HTTP Conditional Get for RSS Hackers (from way back in 2002!) for how this works.

This is such a great thing! It means less bandwidth uses, less energy consumed, etc.

Unfortunately, not every server implements the server side of Conditional GET. (Boo.) So we have a second method of avoiding work.

Feed Content Hashing

When NetNewsWire parses a feed, it creates a hash — MD5 is fine for this sort of thing — of the content of the feed.

It stores that hash along with other feed metadata.

The next time it downloads the feed, it generates a hash of the just-downloaded copy. If the new hash matches the old hash, then the feed hasnt changed, and we skip parsing it. Yay!

Additional Fallbacks

NetNewsWire also looks at the content of the feed. If its definitely an image and not an RSS feed, for instance, it doesnt attempt to parse it.

Yes, this kind of thing happens in the real world: Ive seen it. (Once I even saw a feed URL return a movie file.)

We could do more here, but its not often an issue, so its not a high priority. Just a good-to-have.

Thing It Never Does

Feeds sometimes contain dates for modification times. NetNewsWire doesnt trust these at all. In-feed dates are never used for making any decisions about parsing or not.

When an article has a modification date, that date is stored in the database. But its there only in case it should be shown to the user. (Sometimes articles in a feed have a modification date but not a publication date — why oh why? — and in that case we display the modification date.)