Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Making RSS Scale

By Tod Friendly in Internet
Mon Sep 06, 2004 at 09:14:42 AM EST
Tags: Internet (all tags)
Internet

I've been meaning to write about this since I saw a one-page article on the subject of RSS clogging websites' bandwidth in NewScientist back in June. But it wasn't until I saw this Slashdot story on the same subject that I knew I had to get my thoughts down. A lot of sites are suffering from a big problem — they are being systematically hammered by RSS newsreaders for new material at regular intervals.

When I first came across RSS around two years ago and did some reading up on it, it sounded like another interesting protocol for a nonproblem. I wondered what was wrong with just refreshing a "What's new?" page. (I have a natural scepticism towards new and exotic protocols.) Nonetheless, after RSS established a niche in the realm of informing us, rolling news style, of new network content, I recognised its usefulness in this situation. I did still feel that its design was suboptimal, no matter how many thousands of bloggers now relied on it.


The problem wasn't with the competing versions and replacements spawned to compete in turn; I've never had a problem with a newsreader failing to read RSS. The problem is simply that RSS doesn't scale, due to a bad choice in design.

When it comes to distributing regularly updated content in the form of articles or snippets as RSS is used for, there are two fundamental methods that a content provider can choose from. They can either leave the resource out in the open and let people come now and again to see if it's been updated (the pull method, since clients pull the data for themselves) or they can allow people to subscribe to be sent the new data once it has been created (the push method, since the data is pushed to the client.).

Now, it's often hard to say in advance which method a distribution protocol will follow, because designers don't often explicitly think to themselves "Hmm, I think I'll implement this as a pull protocol.". They start with an idea and maybe a mental approximation of an implementation, polishing it until it becomes sufficiently easy to use and handy. And most of the time, the developers get it right. But in this case, they got it wrong.

Pull works best when there are larger updates and changes and documents need to be stored in one accessible location for people to access when they want them. Push works best for smaller updates which are time-dependent and do not necessarily need to be kept in one central location. If pull is used for an application where push would be more appropriate, clients need to poll the content provider at random intervals. There isn't any simple way to decentralise the polling and so it can become a fatal drain on resources once the number of clients reaches the thousands.

RSS is not a good mechanism for getting new content to a large number of people; it's just too hit and miss. Still, one would expect it to be usable even if it is pretty suboptimal. After all, it doesn't use up that much bandwidth and it doesn't have that much overhead. It'd take complete idiocy on the part of RSS designers, newsreader coders and clients to make it unworkable in terms of bandwidth.

So it's obvious to me that in its current state, RSS is doomed.

RSS newsreaders harvest RSS in terrible ways. Many hammer the server by checking for updates with intervals of only a few minutes. Others which are popular often cause what is effectively a DDoS attack since they are programmed to always check at the same time past the hour, causing an hourly rush of bandwidth use. Still more don't use gzip compression and ignore HTTP headers in the server's response telling them that there's no new updates to download and so they continually downlod stale RSS files. Even when they don't, they redownload the whole file when so much as a single character changes.

These errors of implementation aren't too bad alone, but when combined with each other and the sheer mass of naive users, they are deadly. RSS needs to be moved to a new distribution network which can handle bursts like this. A decentralised peer-to-peer network is probably the most appropriate solution. Unfortunately, most existing P2P software is based on a pull mechanism. You search for what you look for and download it. There's no way to push updates to subscribers for the conventional software we're used to. The closest we've come is maybe BitTorrent, but it's still based on the pull paradigm.

(I can understand the scepticism regarding push. Push content was supposed to be saviour of the Internet back in the last century, and flopped miserably for exactly this kind of application. But now it's plain that pull just isn't going to work in this instance, we may as switch to push.)

I started cruising around for a peer-to-peer system that followed push instead of pull. The closest I've found is konspire2b (aka konspire/k2b/kast); it's designed for distributing files to a series of subscribers. The key difference is that the bandwidth used by the content provider is not directly proportional to the number of subscribers; bandwidth limitations are vitiated by sharing the load out amongst channel subscribers via retransmission. There's a convenient web-based user interface, and the source of files cannot be faked by others. It fits perfectly:

konspire2b was designed specifically for "zero day" distribution of high-demand files that will be in low demand shortly after their release.

For a random client, all that's needed is to leave a daemon running in the background to pick up updates. Subscription can be done just by following a normal link embedded in a web page and confirming. When the updated RSS file is sent out, it's downloaded automatically. A normal newsreader can then be used to read it.

That's essentially the solution. It can be done with readily-available software. The only problem now is getting it going; the process hasn't been streamlined because it hasn't occurred to enough people before. It is slightly more difficult to run a dedicated broadcast daemon in addition to a web server to send out blog entries, and it might be very difficult for shared bloghosting services to use k2b for individual users. Something more specialised for this purpose would help, and might be what's needed first.

Still, I have at least given a justification for changing the current state of affairs and pointed to a stopgap solution. Yes, there's probably other ways of doing it. What's needed for things to change is a series of popular bloggers adopting it, or at least pushing for something different. This could take a while.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
How to implement an RSS replacement?
o NNTP 46%
o konspire2b 10%
o Apache module (HTTP) 10%
o some other way via HTTP 21%
o write-in 10%

Votes: 28
Results | Other Polls

Related Links
o Slashdot
o this Slashdot story
o konspire2b
o a normal link embedded in a web page
o other ways
o doing it
o Also by Tod Friendly


Display: Sort:
Making RSS Scale | 87 comments (83 topical, 4 editorial, 3 hidden)
You were right the first time. (1.56 / 16) (#1)
by kitten on Sun Sep 05, 2004 at 12:55:21 PM EST

RSS, much like Bluetooth, is a solution looking for a problem.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
Yeah sure (2.50 / 6) (#2)
by spasticfraggle on Sun Sep 05, 2004 at 01:06:18 PM EST

And in other news, BSD is dying.

--
I'm the straw that broke the camel's back!
[ Parent ]
Indeed... (3.00 / 3) (#10)
by araym on Sun Sep 05, 2004 at 04:47:33 PM EST

Since Windows XP Service Pack 2 has proven to be the most secure OS ever I don't think any Linux distribution has much chance anymore.

-=-
SSM

[ Parent ]
Damn (2.66 / 3) (#33)
by truth versus death on Mon Sep 06, 2004 at 03:35:23 AM EST

Sometimes you really wish K5 had a funny moderation option.

On a related note, K5 needs to get 5 ratings back. This place is too boring with all the 3's.

"any erection implies consent"-fae
[ Trim your Bush ]
[ Parent ]
Lose numerical moderation altogether, say I. (2.50 / 2) (#36)
by pwhysall on Mon Sep 06, 2004 at 07:21:40 AM EST

Just use bananas/op-ed hats/ramen noodles/etc to indicate what the shuffling zombie hordes^W^W^Wkuro5hin.org readers think.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]
Good enough for me (none / 0) (#47)
by truth versus death on Mon Sep 06, 2004 at 12:48:55 PM EST

Just so long as it is not 3.

"any erection implies consent"-fae
[ Trim your Bush ]
[ Parent ]
We've had this debate before, I think. (2.80 / 5) (#17)
by ubernostrum on Sun Sep 05, 2004 at 06:50:02 PM EST

Your position is based on only having ever seen RSS used on weblogs. There are wider and better applications where RSS becomes a highly useful tool.




--
You cooin' with my bird?
[ Parent ]
Bluetooth? (3.00 / 5) (#18)
by pwhysall on Sun Sep 05, 2004 at 06:53:38 PM EST

I use it to sync my phone and to transfer data between phone and computer.

It's a dandy technology.

Just because you don't have (a) any devices that use it or (b) any use for it doesn't mean that other people also don't.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

That's neat, Peter. (1.20 / 5) (#31)
by kitten on Mon Sep 06, 2004 at 01:34:24 AM EST

So it was invented in the absence of any real problem, and people found things to do with it -- things that were being handled adequately through other means already. This proves nothing about the usefulness of the BT technology itself.

Someone will always find a use for something, no matter how inherently useless it was to begin with.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
As usual (2.40 / 5) (#35)
by pwhysall on Mon Sep 06, 2004 at 07:05:47 AM EST

kitten displays his ignorance of a technology by declaring it useless.

Look. I know you are congenitally incapable of admitting you're wrong. That's OK. You're not alone.

Bluetooth is a useful technology to some subset of the set of people that is !kitten.

And you're just going to have to accept that, because it's true.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

when... (none / 1) (#46)
by shokk on Mon Sep 06, 2004 at 11:26:00 AM EST

are people going to learn that just because they find something useless, that it may be useful to others.  People like that think they are everyman and anyone unlike them is just...wrong.  

I find bluetooth useful just because it works wirelessly between my headset and phone, and between the phone and my PC for data sync.  If someone else finds it useful for some other reason, well they've adapted the tech for their need, not for what a million other people think it is for.  If it is not useful to you, move on.

I find RSS useful so that I don't have to surf across dozens of pages looking for updates and news.  One glance at my Feed On Feeds reader tells me whether there is an update at Penny Arcade, various Sourceforge projects, Dilbert, SANS newsbites, Slashdot, Mozilla, PHP, Apache, etc.  If there is no update, then I have more time to do other things.  If there are updates, 90% of those need to be looked into right away since they directly impact my job.  Plainly this is an efficiency benefit if I am letting my system comb the web for me rather than doing this myself one page at a time.  If you can say that you have plenty of time to comb all those web pages on your own every day, then you should already be in someone's sights for being replaced.  If it is not useful to you, move on.

Anyway, I believe that caching services like NewsIsFree and MyRSS are the future for RSS.  That is basically an heirarchy of readers and can be extended further.  Sites like Slashdot will enforce this because if you check too often you get IP-banned for days.  What do they do for NAT'd addresses?  Well, they probably ban an IP if there are too many behind that firewall checking the address, so right there you have a need for another level of caching.  Since once an hour is an optimal time for checking sites, the cache itself can actually do the work of requesting no more frequently than that.  The cache can pull once an hour itself, or wait for another client to request the pull as long as it is longer than one hour from the last request.

My beef is in the shoehorning of RSS to do other work like email messages to replace IMAP, though that may be useful to someone else in which case I move on.  Clearly the idea is spreading like wildfire, so I doubt that it is an idea looking for a solution.  Deep linking and RSS readers have changed web navigation these days so that every page is a potential frontpage and must be presented with nearly the same emphasis.

"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
[ Parent ]

Hi. Hope you don't mind me calling you on this. (2.75 / 4) (#37)
by Ta bu shi da yu on Mon Sep 06, 2004 at 08:04:51 AM EST

What technology was doing the same things that Bluetooth does now?

---
AdTIה"the think tank that didn't".
ה
[ Parent ]
I don't mind at all. (none / 1) (#59)
by kitten on Mon Sep 06, 2004 at 08:27:48 PM EST

My arguments outlined here.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
Thank you. (none / 1) (#60)
by Ta bu shi da yu on Mon Sep 06, 2004 at 11:45:19 PM EST

That's all I ask for.

---
AdTIה"the think tank that didn't".
ה
[ Parent ]
Who's asserting? (none / 1) (#57)
by pwhysall on Mon Sep 06, 2004 at 06:40:49 PM EST

You're the one who's decided it's "useless" and that it's "highly unlikely" that anyone would have any real use for it, but hey.

Your mind is made up, and, a la Jean Luc Picard, it is so.
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

Should be reply to #56. (none / 0) (#58)
by pwhysall on Mon Sep 06, 2004 at 06:41:30 PM EST


--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]
Not just me, big shifter. (none / 1) (#69)
by kitten on Tue Sep 07, 2004 at 08:24:57 AM EST

But keep ignoring the thousands of others in IT who have the same complaints I do. It's really useful because you use it for silly things. I, of course, still use my old 8 track player, which must of course mean that it's useful too.
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
Yeah! It's practically a democracy. (none / 1) (#73)
by Zerotime on Wed Sep 08, 2004 at 12:17:58 AM EST



---
"You don't even have to drink it. You just rub it on your hips and it eats its way through to your liver."
[ Parent ]
Doesn't matter. (none / 1) (#74)
by kitten on Wed Sep 08, 2004 at 12:23:18 AM EST

Peter's argument basically rested on two things:

1. That he personally finds it useful -- a dubious argument at best and utterly irrelevent at worst.
2. That I personally don't know what I'm talking about. By providing him with links to dozens of other people more knowledgable than either him or I who are also voicing complaints, this ad hominem strategy is removed from him.

mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
I concede. (none / 1) (#76)
by pwhysall on Wed Sep 08, 2004 at 01:56:38 AM EST

Yes, you're perfectly correct, kitten.

Bluetooth is useless. You have spoken! Look! Some people don't like it/can't get it working/etc! Therefore everything I say is true!

I can find literally millions of complaints on the Web about Windows - does this make it useless?
--
Peter
K5 Editors
I'm going to wager that the story keeps getting dumped because it is a steaming pile of badly formatted fool-meme.
CheeseBurgerBrown
[ Parent ]

Kinda like weblogs, eh spunky? (2.00 / 3) (#19)
by buck on Sun Sep 05, 2004 at 07:42:43 PM EST


-----
“You, on the other hand, just spew forth your mental phlegmwads all over the place and don't have the goddamned courtesy to throw us a tissue afterwards.” -- kitten
[ Parent ]
Caching (2.66 / 9) (#3)
by mcc on Sun Sep 05, 2004 at 01:36:24 PM EST

Couldn't many of the problems caused by RSS be alleviated by the widespread use of RSS caches of some sort at the router or ISP level? As far as I'm aware, this is the standard solution for bandwidth waste caused by pull-based transmission methods being used to communicate an essentially push-based information channel; and it can be done without convincing people to change protocols, which as we all know is pretty much an impossible thing to do.

Great article btw.

---
Aside from that, the absurd meta-wankery of k5er-quoting sigs probably takes the cake. Especially when the quote itself is about k5. -- tsubame

Yeah, cache application data at the network level (1.05 / 18) (#6)
by Reiko the Hello Kitty Fetishist on Sun Sep 05, 2004 at 03:41:39 PM EST

You fucking moron. You'd better stick to whiney socialist screeds, it suits you better than technical material.

But what do I know? I just buy worthless plastic crap because it's cute.
[ Parent ]
Hello (none / 1) (#9)
by mcc on Sun Sep 05, 2004 at 04:23:37 PM EST

This is mister "world wide web", and he would like to be your friend! I apologize for the lack of illustrations.

[ Parent ]
Helllllooooo douchebag. (1.00 / 5) (#13)
by Reiko the Hello Kitty Fetishist on Sun Sep 05, 2004 at 05:14:23 PM EST

Clearly you did not take my advice the first time around.

But what do I know? I just buy worthless plastic crap because it's cute.
[ Parent ]
konspire (3.00 / 5) (#5)
by WorkingEmail on Sun Sep 05, 2004 at 03:24:00 PM EST

If you actually read any of the details on konspire, you would have noticed that it is overkill for small files. The protocol overhead scales linearly with the number of subscribers, and it IS a significant overhead.


Unless your RSS feeds are more than 64 kB, you should give konspire a miss and just push directly to the subscribers.


-1, Blogs (2.25 / 12) (#7)
by Reiko the Hello Kitty Fetishist on Sun Sep 05, 2004 at 03:42:56 PM EST

This is a problem that doesn't need a solution. It's a solution to a problem: bloggers.

But what do I know? I just buy worthless plastic crap because it's cute.
Push doesn't scale either (3.00 / 11) (#8)
by TheBobby on Sun Sep 05, 2004 at 04:01:11 PM EST

A few issues with Push that your article doesn't cover.
  1. People don't stay on a static address. With Pull they come to you.
  2. You have to push regardless of if they're connected or not - you don't know their status.
  3. You're still sending out a large volume of data, and no site caches will help with push, unlike pull.

-- Gimmie the future with a modern girl!
Hence the use of a peer-to-peer framework. (2.33 / 3) (#24)
by Tod Friendly on Sun Sep 05, 2004 at 08:23:07 PM EST

Which happens to quite tidily solve those problems. Oops!

echo ${BASH_VERSINFO[$[$RANDOM%${#BASH_VERSINFO}]]}
[ Parent ]
Push and Peer to Peer? No thanks. (none / 0) (#77)
by TheBobby on Wed Sep 08, 2004 at 04:21:59 AM EST

Peer to Peer has some good uses. Bulk delivery of data where it isn't time critical or reliability critical is one of them

Reliable timely delivery of information is not one of them. Depending on the right network of people to be on at the right time to deliver information to me is not practical. That's worse then push from a server. This is completely besides issues such as authenticity of information

Given a peer to peer network for RSS i'll hack perl and build a screenscraper for myself, so I know that the information i get is real and timely.
-- Gimmie the future with a modern girl!
[ Parent ]

Actually... (none / 1) (#61)
by guinsu on Tue Sep 07, 2004 at 01:42:23 AM EST

Actually, you forgot the most important reason why people will insist on Pull for RSS - it makes it impossible to recieve spam through it.

[ Parent ]
What kind of overhead does RSS have? (1.20 / 5) (#11)
by gizzlon on Sun Sep 05, 2004 at 04:57:20 PM EST

Shouldne be a big overhead for such a simple operation .. or?

ø.s
g

feh (2.60 / 10) (#12)
by reklaw on Sun Sep 05, 2004 at 05:03:42 PM EST

The internet is never going to switch to push. Just get the fuck over it. I thought we saw the death of this "omg we're in the wrong paradigm" shit at the end of the dotcom boom.
-
I call troll. (2.50 / 2) (#23)
by Tod Friendly on Sun Sep 05, 2004 at 08:21:18 PM EST

This isn't 1997. I'm not trying to push push for purposes of making cash by forcing adverts to users via low-quality applications. It's about making P2P work for automatically distributing small updates to a large number of people, Usenet style. Usenet was founded on what's essentially a push principle, and it's chugged along fine since the 1970s. You sir, are trolling, and smoking the cheap $3 crack.

echo ${BASH_VERSINFO[$[$RANDOM%${#BASH_VERSINFO}]]}
[ Parent ]
Uh? (none / 1) (#30)
by reklaw on Sun Sep 05, 2004 at 09:44:53 PM EST

How is Usenet push? You mean the same way email is push?

In that case, you could just email people updates instead of using RSS.
-
[ Parent ]

Why mailing lists don't cut it (none / 0) (#51)
by reftel on Mon Sep 06, 2004 at 04:14:07 PM EST

How is Usenet push? You mean the same way email is push?

Usenet isn't push, but email sort of is. NNTP works by two-way synchs, so the whole push/pull thing doesn't really apply. Email, on the other hand is two-level - push among the servers, and pull between client and server.

In that case, you could just email people updates instead of using RSS.

One of the big advantages to RSS is that you don't need to give out any email addresses. You don't have to send subscrbe and unsubscribe messages. There's just a client that feels like a specialized web browser (which it really is).

RSS also simplifies the server, which probably was a major factor in getting the whole thing started - just uploading a flat file to a web server somewhere was enough. No server-side software to install.

When this all comes down to is that this is Really Simple Syndication. While email would seem to be able to solve the problem, it didn't, mainly for usability reasons. RSS is easier to set up and use than mailing lists, so people use it instead. There really doesn't have to be more to it than that.



[ Parent ]
-1, outdated. (2.28 / 7) (#15)
by ubernostrum on Sun Sep 05, 2004 at 06:45:47 PM EST

A year and a half ago this would have been a relevant article. Nowadays people support HTTP 301 and aggregators (and RSS itself) allow you to sanly throttle the number of requests.

Also, "pull" vs. "push" is a stupid debate. -1 just for mentioning it.




--
You cooin' with my bird?
Make that HTTP 304... (2.50 / 2) (#16)
by ubernostrum on Sun Sep 05, 2004 at 06:46:33 PM EST

Damned typo.




--
You cooin' with my bird?
[ Parent ]
Question: why is it a stupid debate? (none / 1) (#21)
by Tod Friendly on Sun Sep 05, 2004 at 08:15:37 PM EST

Pull is failing; take a look at the Slashdot article I linked. Why is it suddenly so wrong to discuss push? Just because it was misapplied back in 1997 doesn't make it automatically bad now.

echo ${BASH_VERSINFO[$[$RANDOM%${#BASH_VERSINFO}]]}
[ Parent ]
Yes, look at Slashdot. (3.00 / 5) (#25)
by ubernostrum on Sun Sep 05, 2004 at 08:32:34 PM EST

Pull is falling and *BSD is dying. What's your point again?

But seriously, it's a stupid debate because it's Yet Another Holy War. Arguing about it accomplishes nothing and solves no problem, and that's not good for the push side because push has problems which need to be solved before it's applicable to anything (caching, assumptions of connectivity, etc.).

And push really has no relation whatsoever to this problem; you go from "Aggregators don't respect HTTP 304" to "pull means RSS is doomed" with really no supporting argument in between.

Then you throw in another buzzword, peer-to-peer, without justifying how that would solve the problem; so we're going to have P2P content syndication to avoid hammering the source, eh? Who grabs it from the source originally to be the first peer on the network, then? You'd need to invent a hierarchical protocol like NTP, with first-tier, second-tier, and so on syndicating, which is frankly far too complicated to be called Really Simple Syndication, now isn't it?

As I said originally, the problem is in the client software; teach it to behave nicely -- respect the 304, obey the RSS syndication module's specs on frequency, etc. (you do know about the syndication module and updateFrequency, right? And if you don't have that there's always skipHours as a last resort) -- and the problems you're describing go away.

And for the most part, these problems are going away; Slashdot did a lot for the update DDoS problem by banning IPs which polled too often, and the major aggregators and aggregator libraries are coming into line (for example, the Universal Feed Parser and Magpie RSS, two of the most popular base libraries for writing aggregators), hence my criticism of your article as outdated.




--
You cooin' with my bird?
[ Parent ]
Failure to anticipate the future. (none / 1) (#20)
by outis on Sun Sep 05, 2004 at 08:01:58 PM EST

Surely the amount of bandwidth we have available will increase as time passes? I can't imagine that the bandwidth used by RSS-hungry clients could ever choke a website. Do you have numbers I can look at?

Alas, bandwidth is still expensive. [nt] (none / 1) (#22)
by Tod Friendly on Sun Sep 05, 2004 at 08:18:34 PM EST



echo ${BASH_VERSINFO[$[$RANDOM%${#BASH_VERSINFO}]]}
[ Parent ]
Solutions at the wrong ISO layer... (2.50 / 2) (#28)
by NoMoreNicksLeft on Sun Sep 05, 2004 at 08:56:16 PM EST

Always have these problems (some less than others, bittorrent is pretty damn cool for an L7 solution).

You need some sort of multicast, but I don't like traditional multicast protocols either. My own silly little protocol is pure layer 4, but the problem it's meant to solve isn't the same one this article is about (does seem to help, though).

Bittorrentesque solutions seem ripe for DoS attacks. When you start seeing your k5 feed full of goatse stories...

--
Do not look directly into laser with remaining good eye.

"When you start..." (1.50 / 2) (#34)
by warrax on Mon Sep 06, 2004 at 06:25:08 AM EST

When you start seeing your k5 feed full of goatse stories...
You just unsubscribe from the channel. Problem solved.

-- "Guns don't kill people. I kill people."
[ Parent ]
Filtering (none / 0) (#67)
by ffrinch on Tue Sep 07, 2004 at 06:26:31 AM EST

Well, ideally the next generation of feed readers will all have decent filtering and grouping capabilities. I wouldn't want to drop K5 just because of a few goatse links, just like I wouldn't want to drop BoingBoing just to get rid of the Disney, SARS art, flashmob (etc) stories.

-◊-
"I learned the hard way that rock music ... is a powerful demonic force controlled by Satan." — Jack Chick
[ Parent ]
Easy solution... (2.33 / 6) (#32)
by Skywise on Mon Sep 06, 2004 at 02:02:01 AM EST

Torrent the sucker.  When everyone makes a run for an RSS feed, everybody distributes it.  Problem solved.

This made it past moderation?!? (2.20 / 5) (#38)
by trezor on Mon Sep 06, 2004 at 09:27:02 AM EST

Oh, what the hell. I might be trolling the subject like everyone else, I guess.

A -better- solution, if bandwidth is the problem here, is for a RSS2-spec (which would be RSS1 compatible more or less). Better in my eyes, since all this P2P/push-nonsense is evading the point.

Let ordinary, intelligent CGI handle the shit. Let the RSS-reader fetch the feed from http://$GivenAdress?LastUpdate=$LastFetchWithAnyNews, and if theres nothing new, a polite disconnect will inform the client that there is nothing new here. If there is something new, well make sure that's all you are sending.

Voila! Bandwidth saved.

And this is a really simple thing to implent. As far as all the "but this will require new clients" go, so will any of the other propositions as well. And this is really just a minor change.

It would anyway sure as hell be simpler and less bandwidth hogging than some P2P-solution with tons of handshakes and even a tracking(!)-server for a <10KB feed.

This gotta be the worst "technology"-article I've seen here since localrogers undatabased searchengine. No offense, but this is really just stupid.


--
Richard Dean Anderson porn? - Now spread the news

Missing the point (2.40 / 5) (#39)
by UnConeD on Mon Sep 06, 2004 at 10:00:23 AM EST

You don't need CGI for that: HTTP already has a mechanism to check for changes.

The problem is that RSS readers are crap and don't respect them.

[ Parent ]

As far as missing the point goes... (none / 1) (#40)
by trezor on Mon Sep 06, 2004 at 10:06:51 AM EST

That might very well be true. I won't bother checking that out, so I'll just assume you are right.

But when I said "evading the point" in my original post, it was more that this isn't really a very iteresting discussion.

My point, and I'll admit I didn't even air it, is that if a website is having trouble deliverying tiny RSS-feeds to crappy clients, then the problem isn't really "scaling RSS".

At this point there is no chance in hell the site will be able to deilver actual content when the user clicks his RSS-delivered HTTP-link to the full article with full content.

Said simple: When a site has so many visitors it cant deliver RSS, it sure as hell cant deliver full content. Scaling the RSS will not really be of great help at this point.

So, I may have missed a point here or there, ok, no biggie, but the concept of scaling RSS without taking the load of the actual content into the picture is just stupid.

At least, that is my opinion.


--
Richard Dean Anderson porn? - Now spread the news

[ Parent ]
Lessee. (none / 1) (#42)
by i on Mon Sep 06, 2004 at 10:44:01 AM EST

At present there are more RSS-requests that content-requests. How much more? My guesstimation is 10 to 1000 times. Worse, their timing is not random, so let's say at peak load there are 100 to 10000 RSS-requests for each content-request. However lightweight, those RSS-requests are potentially a problem.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
In that case... (1.50 / 2) (#43)
by trezor on Mon Sep 06, 2004 at 10:47:28 AM EST

See my excellent trolling in the first post.

It should sort most of these problems out. How much bandwidth do a TCP_DISCONNECT() consume?


--
Richard Dean Anderson porn? - Now spread the news

[ Parent ]
AFAICT (none / 1) (#44)
by i on Mon Sep 06, 2004 at 11:08:47 AM EST

4 packets 6 bytes each. Note that the original unnecessary request can be several times larger than that. How large a dent can it poke in the server performance? I dunno, but my gut feeling that it's not negligible.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
not really the same thing (none / 1) (#41)
by boxed on Mon Sep 06, 2004 at 10:29:26 AM EST

"changed since last?" is quite different from "give me ONLY the things new since last time"

[ Parent ]
Hi. (none / 1) (#54)
by ubernostrum on Mon Sep 06, 2004 at 05:13:35 PM EST

You don't need CGI for that: HTTP already has a mechanism to check for changes.

The problem is that RSS readers are crap and don't respect them.

It's OK to admit that you know even less about the subject than this article's author. Really, it is.




--
You cooin' with my bird?
[ Parent ]
Really? (none / 0) (#65)
by UnConeD on Tue Sep 07, 2004 at 05:45:53 AM EST

Conditional GET includes an 'If-Modified-Since' header. Passing such a value through a CGI query would be redundant and non-standard.

Now if the RSS is generated dynamically (think RSS feeds of search results/queries), you could send only the new items since the last check.

Most RSS readers/aggregators keep old entries anyway (and have to figured out which items are really new every time they check a feed).

None of which would matter one bit, because most RSS software is of abysmal quality. Most RSS auto-discovery will not respect the <base> tag in HTML, so you have to waste space by making all your RSS <link>'s absolute. RSS readers will rarely handle character encodings correctly, either completely denying their existance or just converting everything into a non-Unicode encoding which results in lovely question marks if you try to read more than one language. And as pointed out, most RSS readers have the HTTP skills of a 4 year old.


[ Parent ]

no fucker (none / 1) (#68)
by boxed on Tue Sep 07, 2004 at 06:27:23 AM EST

There's a huge difference between conditional get and getting a list of everything that is new since a given time. The latter will give MORE information than current ugly RSS feeds in some cases, almost always less, and many times NO data. It will give the data that is required, not more and not less. RSS with conditional GET will get too much in many cases and too little in others.

[ Parent ]
Again... (none / 0) (#80)
by ubernostrum on Wed Sep 08, 2004 at 10:21:41 PM EST

It's OK to admit you don't know anything about the current market. Keep right on showing your solidarity with the article's author by complaining about problems in aggregators which were huge a year and a half ago. I know this is tough on your worldview, aggregators and feed-parsing libraries have changed since then. The ones that haven't are enjoying rapidly-declining marketshare.




--
You cooin' with my bird?
[ Parent ]
Bad idea (none / 1) (#55)
by Jim Dabell on Mon Sep 06, 2004 at 06:01:03 PM EST

Let ordinary, intelligent CGI handle the shit. Let the RSS-reader fetch the feed from http://$GivenAdress?LastUpdate=$LastFetchWithAnyNews, and if theres nothing new, a polite disconnect will inform the client that there is nothing new here. If there is something new, well make sure that's all you are sending.

Voila! Bandwidth saved.

Actually, it's "Voila! Even more traffic." Your "solution" is actually worse.

The current method clients use is this:

  1. Request the resource, providing If-None-Match and If-Modified-Since headers.
  2. If an intermediate proxy has a cached copy, it can serve that.
  3. If the resource hasn't changed, the server can respond with a 304 Not Modified response.
  4. Only if the resource has changed will it be downloaded.

An important thing to note is that, if you configure decent caching headers, and receive enough traffic to be worried about bandwidth, then your RSS feed will often come from an intermediate proxy and not from your server.

Your suggestion: http://$GivenAdress?LastUpdate=$LastFetchWithAnyNews indicates a unique resource for every client, every single request. It's uncachable, meaning your bandwidth bill shoots up.

There's also no need for it - it's a replica of the If-Modified-Since header, except it doesn't work anywhere near as well.

As the original article stated, there are a few badly-behaved clients out there that do not transmit If-Modified-Since and don't understand 304 responses. The majority of them do however, and it makes far more sense to fix them then it does to come up with a new, non-standard, broken way of doing conditional gets and implement that instead.



[ Parent ]
Well, that's a good point. (none / 0) (#63)
by trezor on Tue Sep 07, 2004 at 05:00:19 AM EST

However I also mentioned sending only what was new, which should balance the equation somewhat.

And this is somewhat different from the a pure HTTP 304-based system.


--
Richard Dean Anderson porn? - Now spread the news

[ Parent ]
A possible solution? (none / 0) (#66)
by trezor on Tue Sep 07, 2004 at 06:19:47 AM EST

A "compromise" of the 304-solution and my CGI-solution would be a CGI-based HTTP-redirect.

It would work like this. The client would request as I orginaly proposed:

    http://$GivenAdress?LastUpdate=$LastFetchWithAnyNews
The CGI would do a header based redirect to a new location with static content, based on last fetch. Or simply disconnect if there's nothing new. In case of redirection it would be a adress somewhat like:

    http://$GivenAdress-NewSince$WhateverCGIfindsOut

This should be cahceable for proxies, and would save enormeous bandwidth.

If the site owner don't want dusin's of RSS-files hanging around, he can ofcourse dynamicly generate for a RSS-feed for really "old" requests, instead of redirecting, since "old" requests probably won't have much of a cacheing benefit anyway.

Or he can use the 401-handler to dynamicly create RSS feeds for the kinds of adresses mentioned above. And make them seem static to any source. There are tons of solutions for making this work.

All in all it should be dead simple to implement.

This should satisfy your criterias for cacheability as far as I can see, and it also involves huge bandwidth-saving by not sending more data than necassery.

Ok. If this RSS thing is a problem, this is my solution for RSS2 and it should shave bandwidth issues pretty good.


--
Richard Dean Anderson porn? - Now spread the news

[ Parent ]
Proposed solution: RSS proxy (3.00 / 2) (#45)
by Bluelive on Mon Sep 06, 2004 at 11:09:34 AM EST

Uses a central server that polls the feeds that are used by the clients in a friendly method (reduces load by the amount of clients) The changes are pushed to the clients via an IRC style network, where the client runs a loopback http server where his own rss tool requests the feeds at an as high as you like interval (typically 4 times higher then that of the server) to reduce message lag. This way the client reduces his internet datatraffic alot and the hosting sites have much less load, which might then allow the server to poll more often.

Simplicity of implementation (none / 0) (#49)
by ChaseBase on Mon Sep 06, 2004 at 03:01:51 PM EST

Even with the various problems of push mentioned above, it seems intuitive that a push could be written more efficiently than a pull on a massive scale. That's why everyone tried push first. The model of broadcasting made sense.

Everyone wanted the simplest model, but there are two different values. From a subscriber's perspective the simplest interface is push. But it is far simpler to implement pull, especially in the case of RSS feeds. Implementation simplicity wins again. The very thing that makes a feed so simple, that it is just a static file sitting on a web server, also makes it somewhat ineffecient as Tod points out.

All that being said, push is not going to swamp the Internet... that is the same thing people said about voice over IP, and video on demand, and VPN. You could almost substitute any one those things into the initial post.

It would seem the Internet has mechanisms to self-regulate greedy behavior. A certain level of efficiency will be required from feed readers if they don't want to be treated as Denial of Service attacks. This level will escalate as bandwidth needs rise, until only good-citizen feed readers operate successfully (perhaps with glaring exceptions).

Conditional GET (3.00 / 2) (#52)
by sv on Mon Sep 06, 2004 at 04:20:44 PM EST

Conditional GET is the way to go.

It's just good manners (none / 0) (#53)
by mariox19 on Mon Sep 06, 2004 at 05:12:34 PM EST

I built a little Java application that transforms RSS into HTML. It allows people to subscribe to an RSS feed and incorporate the info into their Web page:

http://synserv.sf.net

It of course checks the modification time from the Web server before downloading the feed. Some servers however are not setup to return modification time, unfortunately. Still, my program will not send a regular GET until a specified amount of time has elapsed (default: 15 minutes).

A conditional GET (as you call it) is just good manners.



[ Parent ]
Yup. (none / 0) (#84)
by Freaky on Sat Sep 11, 2004 at 10:13:09 AM EST

I'm tempted to ban RSS access to clients which don't support conditional GET.. content compression negotiation should really be a requirement too; if you don't support these things, you're costing me more bandwidth than you need to, and that's just rude.  It's not as if these things are hard or expensive to impliment :/

[ Parent ]
No (none / 1) (#62)
by Xtapolapocetl on Tue Sep 07, 2004 at 02:33:04 AM EST

RSS is fine (although Atom is better...) The problem is broken readers - those that pull at a given time instead of x minutes since the application was opened, those that don't respect HTTP headers, those that don't support compression, and so on.

~Xtapolapocetl

--
zen and the art of procrastination

An extension proposal for RSS or Atom syndication (none / 1) (#64)
by gusnz on Tue Sep 07, 2004 at 05:40:47 AM EST

As many other commenters have noted before me, pure "Push" failed for many reasons, including dynamic IP addresses, restrictive firewalls, etc. Regular "Pull" techniques Just Work (tm), and as such got RSS off the ground.

However, one issue that must be taken into consideration is the fact that any proposal really should be an extension, rather than a replacement, for the current popular syndication formats; we have enough in the way of RSS variants plus Atom drafts already. (This is not intended as a slight against the Konspire2B project, by the way!)

With well behaved readers using Conditional GETs and HTTP 304s, the overload problem is somewhat diminished, but I think that we could meld together two more ideas in a backwards-compatible way...

Dynamic Proxying: Popular feed readers should include a HTTP server component that can be activated by altruistic users, or is perhaps active by default. Requests sent from these clients should contain amongst the requesting HTTP headers:

X-FeedCache-URI: http://12.34.56.78:9012/www.remotesite.org/index.xml
X-FeedCache-MaxPerMinute: 30
X-FeedCache-ValidSeconds: 180

to indicate that they are willing to serve up 30 requests a minute for a period of three minutes, mirroring the original feed at www.remotesite.org.

More redirect status codes: The syndication formats should mandate that all compliant clients must understand a full, or near-full, list of HTTP Status Codes. In conjunction with the above proposal, clients in particular must parse a "307 Temporary Redirect" response along with a "Cache-Control" field to redirect to a temporary P2P mirrored copy of the file for the duration that the mirror is willing to serve the feed. It is assumed that the specifiation would allow further requests to the original URL if the chosen mirror was offline.

These would require minor changes and would be completely backwards compatible with existing clients and the RSS/Atom syndication format, yet give most of the benefits of a P2P approach, allowing the original server to redirect requests out to a network of peers. Workable, or not?


[ JavaScript / DHTML menu, popup tooltip, scrollbar scripts... ]

It doesn't address authenticity... (none / 0) (#72)
by piranha jpl on Tue Sep 07, 2004 at 10:18:10 PM EST

It doesn't address authenticity, so it would be trivial for a malicious user to, eg, offer to "mirror" feeds it is downloading, and serve those feeds to peers with goatse articles interspersed. Perhaps an X- header could be added to the server response to give the SHA1 hash of the most recent feed file?

Also, addressing the delta updates problem would be nice.

- J.P. Larocque
- Life's not fair, but the root password helps. -- BOFH

[ Parent ]

It's a simple fix as you point out. (none / 0) (#75)
by gusnz on Wed Sep 08, 2004 at 01:47:12 AM EST

I realised I left that out just after hitting "Post" :). Yeah, the original server should just include a hash of the contents, as most P2P systems do these days.

Delta updates would be cool too, but could really be layered on top of a distribution protocol (i.e. include a standardised ?lastdate=foo query param or header; compliant servers can just trim down their output). Mirrors could even be later extended to keep their own copies up to date that way.


[ JavaScript / DHTML menu, popup tooltip, scrollbar scripts... ]

[ Parent ]

One problem (3.00 / 3) (#70)
by m50d on Tue Sep 07, 2004 at 12:26:25 PM EST

We are part of an increasingly firewalled internet. Some ISPs will only offer you a NAT-based service. This is only going to increase with the IP address shortage, at least until the internet moves to IPv6. In this environment having a push system for content aimed mostly at average home users is simply not going to work.

RSS is successful *because* of polling (3.00 / 2) (#71)
by geof on Tue Sep 07, 2004 at 10:07:49 PM EST

Like HTML, the technology is so simple that anyone can create an RSS file and upload it to any old web server. If RSS had required more than this, it never would have succeeded.

I wrote a slightly longer version of this argument when I read the original Slashdot article.



Alas (none / 0) (#82)
by MrLaminar on Fri Sep 10, 2004 at 03:16:17 AM EST

konspire2b is pretty much dead. I had used it one year ago, back when there were still 10-25 channels to be found on k2b-announce. I installed it again yesterday, only to find out that probably the only person on the planet who was logged in was me.

And I don't think the developer can blame the users. I am not arguing about Push vs. Pull, but the whole idea seems a little to non-interactive for the intarweb. What's the point of having a machine online 24/7 waiting for "file broadcasts"? Sure, if it were used for RSS distribution, it might be nice, but since many users out there still use dial-up (or some other way which does not allow them to be constantly online), the konspire2b approach, although cool in its implementation, fails to find acceptance.

OTOH, now that the network is empty, it might be a good idea to try it out in a small circle of filesharing participants.



"Travel & Education. They will make you less happy. They will make you more tolerable to good people and less tolerable to bad people." - bobzibub
Trivial hack/solution (none / 0) (#83)
by VE3MTM on Fri Sep 10, 2004 at 04:26:24 PM EST

Since a large part of the problem is the surge of connections on common intervals (half hour, hour, etc.), wouldn't a simple solution be to use intervals such as 31 minutes, or 61 minutes? That way, clients would tend to distribute themselves and  smooth out the surges, as with each poll, it "shifts" a minute. As each client would not be started at the same time on the hour.

Sure, it doesn't solve the problem (though, as other people have stated, the problem really is broken clients, not RSS itself), but it can ease the load on the servers.

Use Coral (none / 1) (#85)
by l3nz on Sun Sep 12, 2004 at 07:39:53 AM EST

The Coral P2p Cache could be a serious answer to the question of RSS scalability. This URL - http://www.pangea.va.it.nyud.net:8090/backend.php - points to Pangea's RSS feed accessed through Coral; as you see, you simply have to add .nyud.net:8090 to the server name; Through my log I only find RSS Feed loaded from "CoralWebPrx/0.1 (See http://www.scs.cs.nyu.edu/coral/)" at 129.137.253.252 - 102458 bytes generated in 170 ms every once in a while. :-)

Popk ToDo lists - yet another web-based ToDo list manager. 100% AJAX free :-)

What about push? (none / 0) (#87)
by werner on Thu Sep 23, 2004 at 06:06:37 PM EST

A push-type system would seem to be the obvious answer: like a mailing list. And, just like a mailing list, e-mail would seem to be the obvious solution; leveraging existing protocols and standards.

The RSS feed could be contained in/attached to (zipped) the email or sent as a link to an RSS file (dynamically generated) containing all the new items since the last update. The RSS reader could access the mailbox directly or receive e-mails piped on the way to/from your regular client, and act accordingly.

Okay, it's surely not the most efficient solution, and it does require modifications to RSS readers, but at least it leverages existing technology and minimises the changes necessary to existing clients and servers.

I couldn't say for definite that it would require less bandwidth than if all clients observed the conditional GET, but I suspect, in the case of larger feeds, it would. Additionally, a push-based system puts the control back in the hands of the service provider: the provider can throttle the sending of the updates and, therefore, not have to worry so much about every client grabbing the feed on the hour and maxing out the server's capacity.

For years, we've all been receiving notification of updates to our favourite websites via e-mail. Wouldn't it be ideal, if these e-mails all conformed to a certain standard which could be parsed, accumulated and organised by a program dedicated to showing you the latest news?

Thoughts?



Making RSS Scale | 87 comments (83 topical, 4 editorial, 3 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!