Imagine if you will, a 10K page being sent to the client. (I'll pick 10K as a nice, round number, since most things in my netscape cache are right around 10K.)
The Way Things Work Now: Client makes a connection to the server and tells it what it wants, server spits a 10K file down the wire. End-of-line.
With HTTP Delta: Client generates some sort of checksum against what it wants, makes a connection to the server and tells it what it wants, the server looks at the file and generates some kind of ID/hash which it compares to what the client sent, if they're different, the server has to figure out how to generate its diff, do so, and then finally send a 9K diff back down the wire.
Ok, so this is an extreme example, but you get the idea.
And I'm assuming anything larger than 10K is probably either going to be an image or a completely dynamically-generated page, in which case, if its different from what the cache already holds, there's a very good chance that whole thing is going to have to be sent over the wire anyway, thereby making this whole scheme utterly useless. (How much chance do you think that a 100K image is going to be 90% the same bits if it gets changed?)
The author points to some studies that show that some percent of some transactions have some percent of things different and does a bunch of hand-waving to make it look like those values actually mean anything and have any bearing on delta encoding a response. At least he mentions that those percent are going to be different based on Content-type.
What exactly is the server generating that diff against? Does everyone have to keep their site in CVS now? Or better yet, some proprietary HTTP Delta format? If the server doesn't keep "old" versions of their files around, then the only way to generate that diff is for the client to send the entire file all the way back to the server so it can compare it and then send the diff!
Again, the author barely even does any hand-waving to explain where exactly the server is going to find its older version of the file to make this diff, which in my opinion is possibly the most important part of making this whole scheme work!
Also, keep in mind, large CVS servers actually have to hold up under a pretty severe load when a lots of people are accessing them. Imagine what kind of hardware would have to drive a site like Slashdot, for example, if it had to not only generate a new page for every client request, but then generate a diff against an arbitrary older version of that page.
How exactly would a site like Slashdot generate a diff against, for example, a particular comment page as of yesterday at 5pm? Does that mean any time someone accesses a site that's been dynamically generated that generated page is going to have to be given an ID number and stored statically on the server? A site like Slashdot that generates tens or hundreds of thousands (or millions?) of hits, and therefore pages, a day would generate a tremndous amount of data? Who's going to buy them their terabyte array to store that stuff?
I didn't see the author take this into consideration at all. He mentions right away that today's dynamic sites generate a LOT of their content ENTIRELY dynamically but never bothers explaining how that fits in with keeping diffs around. Trying to keep diffs against a site like Slashdot or Kuro5hin would require making a detailed transaction log of every REQUEST made to the site - something that no sane system architect would ever consider implementing.
I'd like to see some concrete studies to find out if this would *really* be more efficient than just say, paying attention to the "Last-modified" header in a proxy request. i.e.:
Proxy requests this piece of data and includes a "Last-modified" timestamp on the content it has currently in its cache, the server checks the "Last-modified" timestamp on the request the proxy made and if it's the same as what's on the file it has locally it sends a "no cache update necessary" response, if they're different it spits its 10K over the wire.
I'm going to guess that the amount better HTTP Delta encoding is going to be over utilizing a "Last-modified" header that using is: none at all.
This guy might know what he's talking about, but in this case, he sure doesn't seem to know what he's talking about.