Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Improving HTTP caches with delta encoding

By Anonymous Commando in Technology
Tue Nov 14, 2000 at 04:15:46 PM EST
Tags: Internet (all tags)
Internet

Look at most popular sites these days - database driven, dynamic stock quotes / poll results / news stories on the front page... but in reality, very little data changes on the page. Look at the K5 front page - every so often, a new article is posted, or the poll totals change, but for the most part, the actual data that changes is pretty small.

I've always thought that networks of caching proxy servers are a great system for reducing bandwidth use and increasing (at least perceived) network performance. However, the biggest problem with proxy servers (and even browser caches) is the trade-off between cache performance and "stale" data. I just came across an article about HTTP Delta Encoding that presents an interesting system to improve how caches update stale data - only send the parts of the file that have changed.


Here's the basic idea:
  • For each document served, the server includes some sort of ID tag that identifies the "version" of a particular file in the HTTP response headers
  • When a client requests a file, it sends the ID tag of the copy that is has in the cache in the HTTP request headers
  • If the ID tag in the request is different than the ID tag of the current file, the server returns a "patch" to the file that is in the client's cache, as well as the updated ID tag
The author claims that for "delta-eligible" documents (i.e. a document that is already cached, and that has changed on the source - about 10% of all requests, about 25% of all HTML requests), bandwidth savings can be as high as 60%. For all traffic, he estimates 8.5% bandwidth savings and a 6% improvement in latency. For HTML traffic, he estimates up to 24% savings in bandwidth. I haven't bothered to check the math, but intuitively it feels like it's probably close to the truth.

A few issues came to mind while writing this story for K5:

  • What would be the best way of generating the ID tag on the server side?
  • For database-driven sites, how much more server-side processor load would this require?
  • Would the web server have to keep a "history" cache of every page served over the last X hours/minutes/days to be able to provide a reliable "patch"?

Let's face it - these days, bandwidth is cheap, and squeezing an extra few percent doesn't seem all that imporatnt right now, but it's good to remember that there wasn't much need for fuel-efficient cars in North America before the "oil crisis" of the 1970's.

By the way, the author of the paper is Jeff C. Mogul of Compaq's Western Research Lab, who is, among other accomplishments, the co-author of the HTTP/1.1 specification.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
Is "delta encoding" a good idea?
o Yep. 36%
o Nope. 10%
o Maybe. 42%
o Huh? 9%

Votes: 73
Results | Other Polls

Related Links
o networks of caching proxy servers
o HTTP Delta Encoding
o Jeff C. Mogul
o Compaq's Western Research Lab
o Also by Anonymous Commando


Display: Sort:
Improving HTTP caches with delta encoding | 22 comments (22 topical, editorial, 0 hidden)
Server load... (3.80 / 5) (#1)
by japhar81 on Tue Nov 14, 2000 at 03:19:48 PM EST

This is a great idea, and something I toyed with a bit in the olden days of cgi-driven DB apps. The only problem I kept running in to, and I dont know of any new tech (except maybe the gigahertz chip) that will affect it, is the server load. You're either looking at caching the 'patches' for each type you generate (one for a to z, one for s to z, etc. provided a was version 1, and z is the latest copy) or generating them on-the-fly, which means caching all of the old versions. What this seems to wind up leading to is alot of server-side caching, processing, and, unless youve got oodles of ram, disk access. End result: Lots of server horespower needed. Almost seems pointless, it's easier to upgrade the network I'd imagine. But, still, very cool idea.

<H6>Rome is always burning, and the younger generation never respects its elders. The time of your second coming, japhar81, is no exception. -- Aphasia</H6&gt
Oil crisis? (3.50 / 10) (#2)
by trhurler on Tue Nov 14, 2000 at 03:19:50 PM EST

There's no comparison here. Bandwidth is not something you consume and then it is gone; as you build more capacity, you are permanently enhancing the infrastructure - and the infrastructure is improving at an ever-increasing pace. Frankly, saving less than 10% at the expense of adding many, many proxy servers, writing much complex and therefore difficult new software, and introducing opportunities for all kinds of man in the middle semantic attacks just doesn't seem worth it when you can spend far less time and money and get far more increase in bandwidth by just improving routers, switches, cabling, and so on, and not only is this true today, but it is somewhat silly to posit that it would ever not be true - if nothing else, you can always simply add more cables between points a and b. I could see this being a viable special-purpose technology in a few niche markets, but that's about it.

--
'God dammit, your posts make me hard.' --LilDebbie

Why a version tag? (3.57 / 7) (#3)
by Inoshiro on Tue Nov 14, 2000 at 03:33:42 PM EST

Wouldn't a low-collision hashing algorithm, like MD5 or RIPEMD, provide a more effective way of encoding the message "there has been a change" ? It'd also help for defeating any man-in-the-middle attacks by doing careful checking of previous checksums and diffs against what is sent (although it'd be more effective if it also had some form of public-key handshake routine beforehand).

Then again, I'm paranoid :-)



--
[ イノシロ ]
Come on now... (2.66 / 3) (#4)
by japhar81 on Tue Nov 14, 2000 at 04:14:35 PM EST

Youre talking about database updates to sites like K5... Why in gods name would you need to do that sort of secure checking? What possible motive could someone have for playing man-in-the-middle? "I am computer god script-kiddie, I kept you from seeing that story for a few extra minutes!!!! HAHAHAH!!!"

<H6>Rome is always burning, and the younger generation never respects its elders. The time of your second coming, japhar81, is no exception. -- Aphasia</H6&gt
[ Parent ]
Two reasons (3.25 / 4) (#6)
by trhurler on Tue Nov 14, 2000 at 04:41:31 PM EST

One, because no technology is ever "just used" for what it envisioned as being. Two, because you can. It isn't hard, and it is superior. If you're going to do it, why not do it right?

--
'God dammit, your posts make me hard.' --LilDebbie

[ Parent ]
Stock Quotes (4.00 / 4) (#7)
by kallisti on Tue Nov 14, 2000 at 04:43:31 PM EST

Although I doubt K5 would get spoofed, the ability to spoof stock quotes and news could be disastrous. Here's an article about one guy who made a quarter million that way. Of course, he did get caught but others might be harder to find. This leads directly into Bruce Schneier's "semantic attacks".

Thet said, I don't see why adding the delta would make man-in-the-middle any easier to implement.

[ Parent ]

Traffic and templates (4.22 / 9) (#5)
by jesterzog on Tue Nov 14, 2000 at 04:41:26 PM EST

Another good way to reduce the amount of web traffic would be to simply notice that 90% of most web pages are templates.

Look at the HTML source of any normal website, k5 included, and the majority of code on every page is identical. Some structures are repeated over and over again (such as divisions around each comment), and only a small amount of information - usually hidden at the end of each document - actually changes between the pages.

There's a lot of technology on the server side that takes this into account, with databases on the back end being run through scripts to generate mostly identical, easily updatable and repetitive pages. IMHO, it would be really beneficial to have client side technology aswell though. This would mean transmitting the instructions about how to generate the page instead of the entire page. The template only has to be sent once, it could be cached by the browser, and data can be updated much faster.

There are some potentially good XML ways to do this. I was messing around with XSL a couple of months ago, but at least at the moment it's not really feasible because only IE supports it, and not all that well either. Does anyone have any experience with doing anything like this in web pages?


jesterzog Fight the light


Traffic and templates (4.33 / 3) (#10)
by CubeDweller on Tue Nov 14, 2000 at 09:38:54 PM EST

This would mean transmitting the instructions about how to generate the page instead of the entire page.

I agree with the beginnings of your idea, though I have to admit I got a chuckle out of the line above. After all, is not HTML already a set of instructions for generating a page? So in effect, you propose sending instructions for generating a set of instructions.

Anytime I see anything about having instructions for generating other instructions, I take a big step back. That usually indicates the solution is to add more simplicity, not more complexity. If it's really true that 90% of all dynamic content is based on templates, something I could easily believe, then isn't that what we really need? ...templates?

What I'd like is to see the entire site based on one or more templates with all of the dynamic content on that site geared towards those templates. As a client, I make a first request and am given a small document carrying only the dynamic content and a matching template identifier. I check my cache to see if I have the template specified. If not, I request that probably somewhat larger template. Then I keep that template around and re-use it so that I only need to request the small content documents in the future.

As you and others have already pointed out, this is done server-side millions of times a day. On server-side I believe it's done primarily for maintainability. Why don't we do it client side as well?

I've never understood why it's always the server's responsibility for merging the content with the template. Clearly when consistency and reliability of data are requirements the server can't have clients using arbitrary methods, and it's good the server does the merge. On non-critical sites like Kuro5hin though, why not? Let the client do it and save on re-transmission of templates.


Or maybe this is too simplistic or already possible. Not exactly my area of expertise...

Seth

[ Parent ]
Traffic and templates (3.00 / 2) (#12)
by CubeDweller on Tue Nov 14, 2000 at 09:51:09 PM EST

Nevermind. This is already what you're suggesting. I agree with you on all counts except the wording of that one sentence. Next time I read one comment at a time =)


Seth

[ Parent ]
XML and XSL can do this.... (4.50 / 2) (#15)
by 11oh8 on Wed Nov 15, 2000 at 04:59:01 PM EST

Every page has a link to the related template, interface or whatever you wanna call it.. This would be a XSL file.. After the first request, the XSL would be in browser cache and only the XML file (which would only be the data and no other display related junk) would be transmitted... The browsers would use a Conditional GET to check if the XSL has changed... If you want to change the look of the site, change the XSL and then when browsers do a Conditional GET, they'll be notified that their copy is stale and get the new XSL....

One difficulty with the above system is that you have to be much more organized about your content.. There has to be a strict seperation between content and layout. All this is good design but it's not always favored when a big deadline looms over you.

With XML/XSL there's also the difficulty of making it work with all browsers. website designers are afraid of using even DHTML because of incompatibilities between browsers.. that's why Flash is so popular even when you can achieve most of Flash's features using DHTML (HTML + CSS + Javascript).. With Flash, if it's installed on the client side, then it is consistent regardless of browser, platform, whatever...... For the XML/XSL system described above to work, most of the browsers will need to support it (and support it the same way) before I can see something like this take off..

A the end of the day, for a manager/developer it is a lot easier to buy more hardware and bandwidth rather than take chances with the client's browser...

11oh8

[ Parent ]
client side templates (3.00 / 2) (#11)
by tonescope on Tue Nov 14, 2000 at 09:40:39 PM EST

I like this idea.... not sure how we could implement it... essentially we are just talking bout client side includes... hmmm.. anyone got any ideas on how this can be easily done...
-0 no sig
[ Parent ]
XML! (4.50 / 2) (#16)
by CaseyB on Wed Nov 15, 2000 at 05:04:36 PM EST

This would mean transmitting the instructions about how to generate the page instead of the entire page. The template only has to be sent once, it could be cached by the browser, and data can be updated much faster.

There are some potentially good XML ways to do this.

This is precisely where XML and XSL fit in! You recieve and cache a single "kuro5hin article" XSL template, and the server needs only to fire over the raw content after that. The client manages the assembly of the page. So we get extended functionality, and less load on the server to boot.

Other neat stuff: the entire site can be "reskinned" by replacing the XSL sheets. They use the same XML data, but can radically change the way it's presented.

but at least at the moment it's not really feasible because only IE supports it,

This isn't necessarily a showstopper -- the XML/XSL parser can run either on the client or server side of the system. You can let "smart" (heh) clients like IE do it client side, and fall back to server-side parsing otherwise. The same data is used in either case.

and not all that well either. Does anyone have any experience with doing anything like this in web pages?

I'm am lucky (no, really) enough to be working on development of an application that can dictate platform, and is going to function in precisely this manner. Once an XSL sheet is loaded, nearly all of the traffic between client and server will be accomplished by sending only the required data, in XML format, back and forth. And we aren't even bandwidth-constrained -- this is simply a better way of doing things for us. Having access to the unprocessed data on the client side after the page is built allows for all sorts of useful client-side data manipulation as well.

The XML support currently in IE 5.5 is more than adaquate for our needs.



[ Parent ]

Proxy-XSLT engine? (none / 0) (#21)
by costas on Fri Nov 17, 2000 at 08:37:33 PM EST

Why rely on IE? can't you grab the XSLT engine off, say Jakarta, and stick it to a web proxy? So instead of asking for http://server/index.html the proxy would ask for http://server/index.xml, cache the XSLT if need be, do the transform and pipe it down to IE/Netscape/Lynx.

The problem isn't really technical; it's one of standards: if there was a clean way (thru HTTP headers perhaps?) to ask for preferrably XML content over HTML content, a simple proxying engine could do this today.


memigo is a news weblog run by a robot. It ranks and recommends stories.
[ Parent ]
Accept: text/xml (none / 0) (#22)
by bgdarnel on Fri Nov 17, 2000 at 08:48:59 PM EST

The problem isn't really technical; it's one of standards: if there was a clean way (thru HTTP headers perhaps?) to ask for preferrably XML content over HTML content, a simple proxying engine could do this today.
There already is a defined HTTP header for this purpose: "Accept:", defined in section 14.1 of the HTTP 1.1 spec. This XSL proxy sounds like a good idea to me; too bad I don't have time to work on such a beast.

[ Parent ]
I Strongly Disagree (3.70 / 10) (#8)
by Ledge Kindred on Tue Nov 14, 2000 at 05:47:42 PM EST

Imagine if you will, a 10K page being sent to the client. (I'll pick 10K as a nice, round number, since most things in my netscape cache are right around 10K.)

The Way Things Work Now: Client makes a connection to the server and tells it what it wants, server spits a 10K file down the wire. End-of-line.

With HTTP Delta: Client generates some sort of checksum against what it wants, makes a connection to the server and tells it what it wants, the server looks at the file and generates some kind of ID/hash which it compares to what the client sent, if they're different, the server has to figure out how to generate its diff, do so, and then finally send a 9K diff back down the wire.

Ok, so this is an extreme example, but you get the idea.

And I'm assuming anything larger than 10K is probably either going to be an image or a completely dynamically-generated page, in which case, if its different from what the cache already holds, there's a very good chance that whole thing is going to have to be sent over the wire anyway, thereby making this whole scheme utterly useless. (How much chance do you think that a 100K image is going to be 90% the same bits if it gets changed?)

The author points to some studies that show that some percent of some transactions have some percent of things different and does a bunch of hand-waving to make it look like those values actually mean anything and have any bearing on delta encoding a response. At least he mentions that those percent are going to be different based on Content-type.

What exactly is the server generating that diff against? Does everyone have to keep their site in CVS now? Or better yet, some proprietary HTTP Delta format? If the server doesn't keep "old" versions of their files around, then the only way to generate that diff is for the client to send the entire file all the way back to the server so it can compare it and then send the diff!

Again, the author barely even does any hand-waving to explain where exactly the server is going to find its older version of the file to make this diff, which in my opinion is possibly the most important part of making this whole scheme work!

Also, keep in mind, large CVS servers actually have to hold up under a pretty severe load when a lots of people are accessing them. Imagine what kind of hardware would have to drive a site like Slashdot, for example, if it had to not only generate a new page for every client request, but then generate a diff against an arbitrary older version of that page.

How exactly would a site like Slashdot generate a diff against, for example, a particular comment page as of yesterday at 5pm? Does that mean any time someone accesses a site that's been dynamically generated that generated page is going to have to be given an ID number and stored statically on the server? A site like Slashdot that generates tens or hundreds of thousands (or millions?) of hits, and therefore pages, a day would generate a tremndous amount of data? Who's going to buy them their terabyte array to store that stuff?

I didn't see the author take this into consideration at all. He mentions right away that today's dynamic sites generate a LOT of their content ENTIRELY dynamically but never bothers explaining how that fits in with keeping diffs around. Trying to keep diffs against a site like Slashdot or Kuro5hin would require making a detailed transaction log of every REQUEST made to the site - something that no sane system architect would ever consider implementing.

I'd like to see some concrete studies to find out if this would *really* be more efficient than just say, paying attention to the "Last-modified" header in a proxy request. i.e.:

Proxy requests this piece of data and includes a "Last-modified" timestamp on the content it has currently in its cache, the server checks the "Last-modified" timestamp on the request the proxy made and if it's the same as what's on the file it has locally it sends a "no cache update necessary" response, if they're different it spits its 10K over the wire.

I'm going to guess that the amount better HTTP Delta encoding is going to be over utilizing a "Last-modified" header that using is: none at all.

This guy might know what he's talking about, but in this case, he sure doesn't seem to know what he's talking about.



Maybe too strongly? (none / 0) (#18)
by YellowBook on Wed Nov 15, 2000 at 09:09:07 PM EST

What exactly is the server generating that diff against? Does everyone have to keep their site in CVS now? Or better yet, some proprietary HTTP Delta format? If the server doesn't keep "old" versions of their files around, then the only way to generate that diff is for the client to send the entire file all the way back to the server so it can compare it and then send the diff!

I was going to give you a gentle chiding for this, because it'soviously not true that the only way to send only changed parts of a file is to either keep the old version around or to have the client send you one -- think rsync. It seems clear to me that a revised http that supported the rsync algorithm would be of clear benefit, and would have none of the problems that you mention.

Then I actually read the article (silly me), and saw that the author specifically dismisses rsync because it is less bandwidth-efficient for very small changes. The system the author is promoting really does have the problems you mention. Huh. NIH syndrome, maybe? Rsync looks like the obvious approach given the problem.

I still don't think this kind of delta encoding is as stupid as you think it is. It's a tradeoff between bandwidth and server resources (disk and CPU). Which has been getting cheaper faster?

As for the problem of dynamically-generated sites, that's going to bedevil you no matter what kind of cacheing you try to do (looking at the Last-modified header availeth you not when it's generated along with the rest of the page with each request). Delta encoding doesn't suffer from this any worse than plain cacheing.



[ Parent ]
Check out the work done with XDELTA here (3.60 / 5) (#9)
by Paul Crowley on Tue Nov 14, 2000 at 07:03:18 PM EST

Josh McDonald's XDELTA was used to this end. There's a PDF report describing it.
--
Paul Crowley aka ciphergoth. Crypto and sex politics. Diary.
XML/XSL (3.33 / 3) (#13)
by Burb on Wed Nov 15, 2000 at 04:08:53 AM EST

Given that so many pages are templates (as mentioned elsewhere in the discussion), then surely client-side XSL should be able to handle this?

I know that XSL-capable browsers are not universal. In theory if you structure your site using XML/XSL, you can do browser detection and expand the template at the server prior to transmission if the browser does not support XSL.

Well, that's the theory. Can anyone point me to a site that does it?

Why XSL? Here's an Easier Way (none / 0) (#19)
by moshez on Thu Nov 16, 2000 at 05:08:14 AM EST

In X time, we've all moved from HTML to XHTML. (Assumption). Now, we can use XLL, aka XLink, with the "include" variant to ease the pain, and here's how: if the pages are structured as static files, which contain XLink include links wherever there's dynamic content, then our job is done. Examples from the article: internet.com stock price would be on /internet.com-price.xml and the static page with all the frills would simply include it. Weblogs such as K5 can have lots of links on the main page instead of lots of text. Each of these links can be seperately cached. Because of HTTP/1.1 ability to maintain a connection for several downloads, this would have very little negative impact on bandwidth. It's easier on the servers, since the server doesn't have to keep a log of all previous versions.

[T]he k5 troll HOWTO has been updated ... This update is dedicated to moshez, and other bitter anti-trolls.
[ Parent ]
When? (none / 0) (#20)
by Burb on Thu Nov 16, 2000 at 06:40:25 AM EST

Any idea when this stuff will see the light of day? Or has it already?

[ Parent ]
This is being worked on as we speak ... (3.83 / 6) (#14)
by nekonoir on Wed Nov 15, 2000 at 07:04:41 AM EST

Andrew Tridgell (of Samba fame) has done some very interesting work on this problem. If you are familiar with his other creation rsync or saw his talk on the rsync protocol at CALU or another conference, you'll know what I'm talking about.

Basically, rsync takes a rolling CRC and/or an MD4 hash of each 'block' of data. Note rolling checksum. Thus misaligned blocks are still detected. Now if you incorporate this protocol into a web server and a client, you cut bandwidth reqirements by 90% on 'template' based dynamic sites.

All this and more (plus sample code) are online at http://www.linuxcare.com.au/projects/rproxy/ . The guts of the rsync protocol are explained (by Tridge!) at http://rsync.samba.org/rsync/tech_report/ . Hopefully this will make it into Apache & Mozilla. For those of us who live in countries that charge for bandwidth by the MegaByte! this has enormous potential. Likewise for countries with third world infrastructure. Something tells me Zimbabwe wont be rolling out ADSL any time soon.

Gaming (3.00 / 2) (#17)
by BonzoESC on Wed Nov 15, 2000 at 07:43:08 PM EST

Something like this has been done in twitch gaming since the original QuakeWorld. Pre-QW, the client was very closely bound to the server. If there was packet loss or instant congestion, the player was paralyzed. Once the connection returned, the player usually found themselves a smear or fragmented. Then, somebody figured that the client could do some work on its own, such as detect collisions with walls and the like, and therefore allow the player to make some decisions without the server.

In the 1010 half-life netcode, the client machine and the server are synched up so that the client sends a timestamp of the last server snapshot it was given. If the server has more recent data than the client had when the player made his move, the server looks back in time to see what the situation was like when the player moved. If it sees that the player's sniper shot was valid when he made it, the target is dead. This makes the game much more 56k friendly. Unfortunately, those of us who fork out the big-bucks for broadband get crazy things happening like we blow the lagger away, and run behind a wall, at which point his shot comes in and wipes us off, behind the wall.

--

Normally, my sig is an image.

Improving HTTP caches with delta encoding | 22 comments (22 topical, 0 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!