Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

What Does 'df' Do?

By rusty in Site News
Sat Oct 12, 2002 at 10:53:15 AM EST
Tags: Kuro5hin.org (all tags)

As always, I go away and something bizarre and unusual happens. This time I went down to Massachusetts for a construction job, and the web servers disks filled up. What fun.

What Happened?

Scoop has a mechanism by which stories can be cached on disk in a partially-completed state, which comes into play when anonymous visitors come and look at them. Basically, Scoop will grab the page off the disk, fill in some simple templating stuff, and pass it off to the client. It's much faster than generating the page on the fly, which it does for logged-in users due to the customization of user pages.

At the beginning of the month, Google's spiders came by and indexed the whole site. Google spiders are, of course, anonymous visitors, so every story they load is cached on disk. And it's Google, so that means every story. The web servers don't have enough disk space to cache every story, so the disks filled up and all kinds of things went to hell. Actually, the fact that the site has run for several days with no disk space at all is kind of impressive, even if it was running poorly.

The caches are now cleared out again, and web servers back to a reasonable 60-something percent disk usage. I'm sure something else will go wrong soon enough, but at least it won't be that.

Where was I?

My friend Bill is a CAD designer most of the time, but he does small construction jobs from time to time. He had a deck that needed building in Wareham, and I have rent that needs paying in November. So I went down to work for him for the last week. We didn't quite get the deck done, because we were rained out on Friday, but I'll go down next Thursday before the Boston K5 meet to finish it.

Why am I working for a living? K5 still has money in the bank, but I'm preparing to dissolve the company, and the tax situation is fairly up in the air right now. I think I have finally found an accountant who can do her job, with whom I'm meeting this Friday. Basically, I don't want to spend any of K5's money until I know for sure that all taxes are covered. And in addition to that, I'd really like the CMF to have a little money to start off with. So while I could conceivably draw another couple months salary from K5, I'm not going to. When the nonprofit is running, we'll see what the situation is then.

Meanwhile, I'm doing whatever work presents itself. If anyone needs a writer, or any home improvements in the New England area, please let me know.


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o Scoop
o Google
o Boston K5 meet
o Also by rusty

Display: Sort:
What Does 'df' Do? | 70 comments (70 topical, editorial, 0 hidden)
Glad it's back (3.75 / 8) (#1)
by tzanger on Sat Oct 12, 2002 at 10:58:49 AM EST

But what the hell is that?!

ah, the geoduck (4.75 / 4) (#3)
by danimal on Sat Oct 12, 2002 at 11:13:42 AM EST

such a lovely muscle.
<tin> we got hosed, tommy
<toy> clapclapclap
<tin> we got hosed

[ Parent ]
Nothing much.. (none / 0) (#67)
by Inoshiro on Wed Oct 16, 2002 at 02:43:16 AM EST

Just one of those things I like to add to the system around the start of every April, or if a mood strikes me.

Sometimes it's even just someone else's mood.

[ イノシロ ]
[ Parent ]
Planning (4.44 / 9) (#2)
by theboz on Sat Oct 12, 2002 at 11:11:10 AM EST

This is meant as constructive criticism, not anything mean, so don't take it personal.

Anyway, I've seen a lot of other regular users that have been complaining lately about the problems, and the stuff this past week have put some over the edge. Some of this has happened on K5, but I think more has happened elsewhere that people didn't feel weird about bitching about the problems on the site having them.

Anyway, a point that seems to be a common thread that I have noticed is planning. It's obvious that you can't run K5 24/7, nobody could do that. You do have other admins, but hurstdog yesterday didn't even know you were out of town. Perhaps it would be better to come up with some sort of schedule for k5 admins. Even a Yahoo Group would help you out by giving you a K5 admins mailing list, as well as a calendar for when you're going out of town or making upgrades. You also might want to think about finding admins in other time zones. If you knew of someone in the UK to help out as an admin, then you would have the really early morning in the U.S. covered, and have less time to worry about it. You could also put a volunteer support structure in place using something like this that would mean K5 doesn't depend on you anymore to do 99% of the work.

I don't know what you do now, but you could do something like that pretty easily and prevent something like the past week's downtime from happening and pissing off all the users.


The problem (4.66 / 3) (#9)
by rusty on Sat Oct 12, 2002 at 12:34:12 PM EST

How many people do you want to have root on the servers? I'm not real comfortable with many people having that. Right now it's basically me and hurstdog. Where do you find people who are completely trustworthy, highly knowledgeable about Scoop, perl, linux, and mysql, and will work reliably for free?

What I'm trying to do is get the CMF started, which will make those decisions about who works on K5. I don't particularly want the job of supervising a lot of volunteers who have root on the site, myself.

Not the real rusty
[ Parent ]

cmon rusty... (4.60 / 5) (#11)
by willie on Sat Oct 12, 2002 at 12:52:13 PM EST

If it is only you and hurstdog (what about Inoshiro?), then why don't you let him know when you're not available?

I mean, you have admitted that the server needs constant attention to remove the cache when google comes, (which seems pretty sloppy to me, at the least couldn't a cron job delete cache when this happens?) yet you leave without telling anyone who can fix any problems. hmm....

Cmon rusty, I think a bit of professionalism is needed, especially if you're going to get this whole CMF thing going.

[ Parent ]

I'm only available for about 6 hours a day too... (5.00 / 1) (#63)
by hurstdog on Tue Oct 15, 2002 at 06:34:58 PM EST

Because I work. I'll take time out of the day to post the odd k5 comment, but I won't stop working long enough to fix a problem with the cluster. I do that at home. Thats why the last problem took around 12 hours to fix.

Inoshiro has been too busy lately with school and work to even read k5 much, afaik. Though I'm sure he'll be parusing this thread.

Rusty letting me know when he's not available will help, and I'd be sure to keep a better eye out for stuff like this. I should have noticed the disks were full when I logged on the day before they filled, but, lesson learned...

[ Parent ]
My most recent job is good. (5.00 / 1) (#66)
by Inoshiro on Wed Oct 16, 2002 at 02:38:55 AM EST

It lets me attend night school, and I'm free to fix K5 if need be. A few weeks back I rushed out of the office to fix the DNS issue, and I can always jush SSH from work (since I work a deskjob with a PC of my own now). I'd like to be back in the saddle a bit more so I can setup my latest batch of monitoring scripts on the servers, so they'll SMS me when something comes up.

Of course, we still need a bit better distributed communication. That's why I setup the mailing lists :) No one calls me or writes me when DNS goes boo-hoo-hoo, or when something else goes Horribly Awry(TM).

[ イノシロ ]
[ Parent ]
True (4.87 / 8) (#15)
by theboz on Sat Oct 12, 2002 at 01:26:30 PM EST

It is difficult to find people to give root access to, but I'm not necessarily convinced that you need to do that anyway. I'll get to that in a minute. The thing is that I'm pretty sure there are people around who probably would be willing to help a little bit. I'd probably guess some of the people who have done some scoop development, or you have met personally that you trust, etc. I am not nominating him because that's not my intention right now, but someone like theantix would be a decent example. He's helped other people that run scoop set up or fix their sites, he's submitted some code to scoop, and seems to know a little of everything. There are a few other people out there like that, I believe. It doesn't necessarily have to be someone from this site, but it does help if they are.

As far as "giving them root access" I don't think that's necessary. First, I would hope you're not running your webserver and DB as root. Also, I would look at organizing it like traditional system administration. You could have someone that may be a MySQL expert look at stuff on the DB, and not need root access. They just need permissions to the MySQL directories and configuration files. You could have another person that does have pseudo-root access, where they can restart the server, restart daemons, and have full access to the scoop directories and do various tasks like that, but can't add users, change passwords, etc. As far as scoop and perl experts go, I think you already have that pretty much covered with those interested in scoop development. You don't even need to give them access to the system, just have them work with whoever does the sysadmin stuff. In my last job, we did the development work and wrote installation scripts, then passed it off to the sysadmin who ran the scripts that made the changes and replaced the code.

Anyway, I understand that it is difficult to find good people that you can trust. I wouldn't reccomend allowing random people off this site who express an interest in helping to have root access. However, I would think seriously about talking to those who do express an interest, doing a little research on their background, and talking to them on the phone, if not meeting them in person.

Besides, what's the worst that could happen? Someone who is either malicious or ignorant screws up something on the server, K5 doesn't work. This would at the very least put us in the same boat as we are when K5 goes down due to running out of HD space or DB problems, and at best it could solve problems with the site more smoothly.

Oh, and about the CMF. It may take a while before that organization comes into fruition and is able to figure out who all should run this site. It could be months, it could be over a year. I think you're taking a bigger risk of a mass exodus from K5 by waiting. That is a matter of opinion, of course, but there are plenty of signs that people are moving on to other sites because of some of the issues plaguing K5. When another site was down for a few hours yesterday that I visit, another person in that community stated that it was "pulling a K5." It's difficult to get a good reputation, and easy to get a bad one. All of your accomplishments and the cool things about K5 will be forgotten by people who think the site is too unresponsive to be worthwhile. People have short memories to begin with, and the internet seems to speed that up. If the CMF functions slowly like a traditional nonprofit organization, it will be too late to be effective. I'm not trying to bitch or annoy you, just that it could happen.

By the way, I would volunteer to help some, but I get the feeling that you would most likely go bald and have no fingernails left from worrying if you gave me root access to the site. However, if you post a list here of stuff that needs to be done for K5, the CMF, etc. and there's something I can do on it, I'll be glad to help out. As weird as it seems to me, I guess I'm not as frustrated as some of the other people I know with everything going on, and would like to see the site be successful again (e.g. the speed problems resolved, and less politics.)

[ Parent ]

re: Trusting people online (4.00 / 2) (#22)
by Mysidia on Sat Oct 12, 2002 at 03:30:49 PM EST

It's really hard to find people to trust on the internet. Often the best you can do is (IMO):

  • Have their personal and contact information, and go to some lengths to verify the continued accuracy of that information.
  • Make it so they have to trust you in some way, way too -- mutual trust is stronger than one-way trust.
  • Make them acquaintenaces, or at least friends at some level, and know who they are. People who know each other well are likely to be more considerate and trustworthy of each other.
  • Define the parameters of trust clearly -- ie: let them know what they should and shouldn't be doing: perhaps, have a formal agreement.
  • Make it so they have something substantial to lose that they care about [other loss of trust] if they become treacherous/turn untrustworthy. (Analogous to an appartment room deposit or penalties spelled out in contracts for pulling out)
  • Have accountability standards, monitor their activities in using any particularly sensitive facility via logfiling/other monitoring techniques
  • Make sure it isn't possible for them to tamper with the monitoring facilities and blame a hardware or other problem: (For example: distribute monitoring, logging over multiple machines, some of which can only be accessed by you).
  • Let them know their activities are monitored: many people tend to be more trustworthy when they think that they are being watched -- just like knowing you have an alarm system on tends to deter burglars.
  • Don't trust people too much. If you give someone over the internet root access, make sure you have backups of everything that they aren't able to touch (You wouldn't find a stranger off the street, get to 'know' them briefly, and then hand them $1 mil cash, and ask them to buy a house for you, would you?).

Most people actually tend to be trustworthy: in terms of computers, I find that it really tends to be the children (people mentally under 18-20 years of age), and the wannabe hackers, the scrpt kiddies/h4x0rs, who you mostly don't want to trust, and provided you know who they are, there's always the possibility of calling the police if you've found they rooted and deliberately wiped your system, requiring it be recovered from backup, and make sure they know that you would, ofcourse:)

-Mysidia the insane @k5
[ Parent ]
root? (2.66 / 6) (#19)
by dipierro on Sat Oct 12, 2002 at 02:53:01 PM EST

>How many people do you want to have root on the servers?

No one.  Can't Scoop run without root access?

Anyway, the simplest solution to the volunteer problem is to allow others to set up mirror sites.  Decentralized management tends to be the most efficient.

[ Parent ]

regardless there's sudo (4.33 / 3) (#23)
by danimal on Sat Oct 12, 2002 at 03:31:52 PM EST

setting up sudo is fast and easy and could provide a simple solution to the problem of giving someone root access.
<tin> we got hosed, tommy
<toy> clapclapclap
<tin> we got hosed

[ Parent ]
sudo wouldn't work here (5.00 / 1) (#50)
by novas007 on Mon Oct 14, 2002 at 04:41:33 PM EST

system/scoop administration is a full-shell job. Remember, run 'sudo [bash|sh|zsh]' and you have a root shell, so those have to be excluded. You cannot run a shell script (you can edit the shebang line and change it to something to give you a shell). Perl can elevate your privileges, so you cannot run perl scripts either. Sudo is really only designed for single commands, if that. It's not a real solution. I've yet to see sudo put into the hands of someone who the admin didn't already trust, because it's either so easy to get a real root shell or you have to disable so much that it's worthless anyway.

[ Parent ]
yes (5.00 / 1) (#58)
by danimal on Mon Oct 14, 2002 at 07:11:41 PM EST

all great points. i wouldn't expect the shell to be allowed under sudo. at any rate, it also provides logging. as far as trust goes, i wouldn't let anyone i didn't trust onto my servers at any rate, and that's what rusty is dealing with. I could help as I have the requisite knowledge, but it's the fact that i don't have the trust levels with rusty and gang at this point that keeps that from happening (of course i don't really feel like investing that time right now, but hey, maybe someone will).
<tin> we got hosed, tommy
<toy> clapclapclap
<tin> we got hosed

[ Parent ]
Mirror sites (5.00 / 1) (#25)
by Mysidia on Sat Oct 12, 2002 at 03:51:00 PM EST

It's an interesting idea, but it's not trivial. It has many complications of its own.

How exactly would these mirrors work?

You could mirror story content, even comments possibly, but you couldn't really mirror much else (IE: it wouldn't be easy to have the ability to post articles, comments on a mirror)

For things like comment display controls, there would be other issues... (It would seem like you'd need changes to scoop in order to have it act as a mirror)

-Mysidia the insane @k5+SN
[ Parent ]
mirrors (1.00 / 1) (#31)
by dipierro on Sat Oct 12, 2002 at 05:36:06 PM EST

How exactly would these mirrors work?

Give me read access to the database and I'll show you.

You could mirror story content, even comments possibly, but you couldn't really mirror much else (IE: it wouldn't be easy to have the ability to post articles, comments on a mirror)

Why not?  Usenet does it.

For things like comment display controls, there would be other issues...

Display controls could be done on a mirror by mirror basis.

(It would seem like you'd need changes to scoop in order to have it act as a mirror)

Yeah, you'd need to allow read access to (most of) the database.

[ Parent ]
usenet has no authentication (4.33 / 3) (#33)
by fluffy grue on Sat Oct 12, 2002 at 06:20:32 PM EST

It's way too easy to spoof posts, and way too hard to get rid of abusers. Also, say goodbye to comment/story moderation.
"Is a sentence fragment" is a sentence fragment.
"Is not a quine" is not a quine.

[ Hug Your Trikuare ]
[ Parent ]

Many solutions... (2.00 / 2) (#38)
by dipierro on Sat Oct 12, 2002 at 08:02:54 PM EST

PGP, make the usernames user@mirror, or just plunck all the mirror posts under a single username. Moderation would take a little bit more work, but there's no reason it all has to be finished immediately.

[ Parent ]
Good ideas... (5.00 / 1) (#39)
by fluffy grue on Sat Oct 12, 2002 at 09:41:46 PM EST

but I don't know how feasible it,d be to work into Scoop's existing infrastructure. Also, the notion of using comment IDs for the structuring of the data goes out the window, unless you do some sort of centralized "comment handle" allocation or lots of crappy CID juggling or whatever.

Hm. Unless IDs were also on a per-site basis... like, your comment's CID would be 38.k5 or whatever.

I wonder how feasible it would be to integrate a distributed mechanism into the existing K5 infrastructure... yeah, really, just adding the site identifier into all IDs would probably work rather well. Then the only collisions which would have to be handled are user registrations, and there's no reason that can't just remain centralized (and just add a thing soying that it may take $foo time for an account to become active throughout the whole network). Hell, adding a waiting period before an account can be used would probably help squish what little trolling/crapflooding/harassment trouble there is here anyway.
"Is a sentence fragment" is a sentence fragment.
"Is not a quine" is not a quine.

[ Hug Your Trikuare ]
[ Parent ]

Hmmmm.... (5.00 / 2) (#35)
by dasunt on Sat Oct 12, 2002 at 06:44:50 PM EST

What type of monitoring do you have on Kuro5hin's servers?

There are many, many programs out there that will monitor a machine and test to see if processes are running.

Combine that with another server that will occasionally query kuro5hin to see if its up, and you're set. Add a way for it to dial a pager when everything goes down.

Viola! A way to be alerted when kuro5hin goes down, day or night.

Now, the question is, do you really want to do it?

[ Parent ]
Can I have root access? (3.66 / 3) (#41)
by Stick on Sun Oct 13, 2002 at 02:34:52 AM EST

I just want to poke around a bit.

Stick, thine posts bring light to mine eyes, tingles to my loins. Yea, each moment I sit, my monitor before me, waiting, yearning, needing your prose to make the moment complete. - Joh3n
[ Parent ]
value lies in automation (5.00 / 2) (#42)
by martingale on Sun Oct 13, 2002 at 04:09:42 AM EST

I think you're right to be worried about turning k5 into a people administration problem. IMHO the ideal to aim for is a fully automated system. While there will always be a need for at least one system administrator, if you keep the mindset that the system should be self supporting, you'll be better off in the long run and Scoop will be more useful to third parties without extensive administration experience. Conversely, if you think in terms of people maintaining the system, you'll come up with people management solutions to the problems, and that's much less portable to other organisations who want to eventually use Scoop.

About the Googlebot, since it reads robots.txt, why not limit the indexed stories to the last few months only? I don't know what the cached urls look like, but it's probably easy to block really old urls. Set up a cron job to update the contents of robots.txt every month, and whenever Googlebot comes knocking it'll only ask for a limited number of stories.

[ Parent ]

Easy to block old pages... (5.00 / 1) (#52)
by vectro on Mon Oct 14, 2002 at 04:58:52 PM EST

But why? Being able to find old articles on Google is a strength that should be maintained. The right solution is a reasonable cache replacement policy.

“The problem with that definition is just that it's bullshit.” -- localroger
[ Parent ]
Writing (3.66 / 3) (#4)
by nevertheless on Sat Oct 12, 2002 at 11:29:33 AM EST

Can (will) you do tech writing? (Manuals, that sort of thing?)

This whole "being at work" thing just isn't doing it for me. -- Phil the Canuck

Hi Rusty (3.50 / 4) (#5)
by perdida on Sat Oct 12, 2002 at 12:04:38 PM EST

I think you're wonderful. I really need you to answer the email from me though if you can. If you don't want to, that's ok, just let me know.

The most adequate archive on the Internet.
I can't shit a hydrogen fuel cell car. -eeee

Just out of curiosity (5.00 / 5) (#17)
by DesiredUsername on Sat Oct 12, 2002 at 01:59:30 PM EST

Does anybody ever respond to any of your emails or do you have to resort to posting "STREETLAWYER/RUSTY READ THIS" on K5 for every single one?

Play 囲碁
[ Parent ]
Ironically, (5.00 / 1) (#37)
by perdida on Sat Oct 12, 2002 at 07:53:30 PM EST

the very busy and email-swamped are more likely to read k5 than their email..

The most adequate archive on the Internet.
I can't shit a hydrogen fuel cell car. -eeee
[ Parent ]

Have you changed the code in Scoop? (4.40 / 5) (#6)
by greggish on Sat Oct 12, 2002 at 12:08:33 PM EST

You said...

"The caches are now cleared out again, and web servers back to a reasonable 60-something percent disk usage. I'm sure something else will go wrong soon enough, but at least it won't be that."

What have you done to make sure that this won't happen again?  Did you change the code in Scoop that caused this caching problem?  If so, will this code change be available in the Scoop CVS soon.  I'm asking because I'm in the process of setting up a Scoop site and would very much like to avoid this same problem.  Thanks.

Nope (4.00 / 2) (#8)
by rusty on Sat Oct 12, 2002 at 12:28:32 PM EST

You can turn off the disk caching if you don't want to use it. I just deleted the cache files, which will be slowly regenerated over time as actual users view stories (this is a good thing, and what it supposed to happen). I will just have to remember next time Google starts sucking down every story to keep an eye on the cache directories.

Not the real rusty
[ Parent ]
Sounds error prone (5.00 / 2) (#12)
by Carnage4Life on Sat Oct 12, 2002 at 12:59:59 PM EST

I will just have to remember next time Google starts sucking down every story to keep an eye on the cache directories.

Not having an automated way to do this seems to guarantee that this will happen again. I'd also rather you were out living your life than constantly keeping an eye on referrer logs, cache files and disk space.

[ Parent ]
Exactly (none / 0) (#26)
by tzanger on Sat Oct 12, 2002 at 04:09:04 PM EST

Why not have a partition for the cache images so that when it fills it doesn't take the system down? Or have a quota for the user which the process that creates the caches? Or simply say "if > x pages/s are cached, halt new cache creation for y minutes?" That could be tunable automatically based on a running average of how many pages are served up on a normal basis.

Or tweak that last one a bit: limit the number of old stories that caches are created for in the same kind of time period?

I'm not complaining, Rusty, I can do other things than watch K5 all day, but these are just a few suggestions.

[ Parent ]
what about a cron job? (4.50 / 2) (#13)
by el_guapo on Sat Oct 12, 2002 at 01:14:57 PM EST

have cron launch a little script every hour or twice a day that checks disk utilization and if it's above a threshold go and delete those files. pretty simple, no?
mas cerveza, por favor mirrors, manifestos, etc.
[ Parent ]
hmmm this? (none / 0) (#29)
by rohrbach on Sat Oct 12, 2002 at 04:51:14 PM EST

/usr/bin/find /var/pagecache -mtime +12 \
  | /usr/bin/xargs /bin/rm
# crontab entry: 33 */12 * * * /path/to/script

Give a tool to a fool, and it might become a weapon.
[ Parent ]
You don't need xargs (3.50 / 2) (#32)
by fluffy grue on Sat Oct 12, 2002 at 06:18:21 PM EST

/usr/bin/find -type f /var/pagecache -mtime +12 -exec rm \{\} \;
"Is a sentence fragment" is a sentence fragment.
"Is not a quine" is not a quine.

[ Hug Your Trikuare ]
[ Parent ]

Not as efficient as xargs... (5.00 / 1) (#40)
by kcbrown on Sun Oct 13, 2002 at 12:54:44 AM EST

The -exec option to find will cause the command in question to be executed once for each file found, while xargs will execute the command in question only after it's gathered the maximum number of arguments it can pass or has hit the end of the list.

[ Parent ]
eviction? (5.00 / 4) (#16)
by jacob on Sat Oct 12, 2002 at 01:45:57 PM EST

Are you saying your caching mechanism doesn't have any kind of automatic eviction policy? Well, there's yer problem. A simple "evict the least-recently-accessed cache object if the cache is too full" mechanism would likely do the trick, or you could get more fancy with statistics on what the most popular stories have been in the past N number of days. Random eviction policies don't fare too poorly either. A dozen extra lines of scoop:

# ... we've decided to add a page to the cache
evict(currCacheSize() + $newCacheObjectSize - $maxCacheSize);

# evict : Num -> void
# ensures that there are at least the given
# number of bytes available in the page cache.
# Deletes least-recently-used pages to make space.
sub evict() {
  my $bytesToEvict= = shift;
  while ($bytesToEvict > 0) {
       $bytesToEvict -= deleteLRUCacheObject();

"it's not rocket science" right right insofar as rocket science is boring


[ Parent ]

Is that really appropriate? (none / 0) (#49)
by awgsilyari on Mon Oct 14, 2002 at 04:01:45 PM EST

Since the Google hits are going to be the most recent hits, it seems like this algorithm will just cause the older user-viewed pages to get dumped. In other words, the Googlebot will cause a progressive cache flush while polluting the cache with unimportant pages.

It seems like Google hits should just be handled as normal, except that any dynamic content that gets generated shouldn't be cached at all. If Google hits a cached page, then fine, serve it out of cache. But if a dynamic page needs to be generated by a search engine request, just serve it once without caching it.

What do you think?

Please direct SPAM to john@neuralnw.com
[ Parent ]

There are better cacheing schemas. (none / 0) (#54)
by vectro on Mon Oct 14, 2002 at 05:07:45 PM EST

But LRU would probably work. It just means that when Google comes there will be a performance hit as the caceh is churned.

If that turns out to be insufficient, though, then a LFUDA (Least Frequestly Used with Dynamic Ageing) policy would do the job quite well. You might also want to employ queues which give priority to hit count rather than hit size, since the miss penalty has more to do (I imagine) with regenerating the page than with issues associated directly with the size. One example of such a queue is GDSF, which keeps smaller items in a seperate queue. These ideas can of course be combined.

Overall, however, I think that LRU should do the job -- or even FIFO.

“The problem with that definition is just that it's bullshit.” -- localroger
[ Parent ]

well (none / 0) (#61)
by jacob on Tue Oct 15, 2002 at 01:19:09 PM EST

In the case of google it'd have bad behavior, but any cache eviction policy is better than no cache eviction policy when the result of a cache overflow is your computer crashing. LRU is easy to implement,  and the difference between it and the optimal cache policy in terms of overall site performance probably isn't big enough to really worry about after that.

"it's not rocket science" right right insofar as rocket science is boring


[ Parent ]
Selective Caching (5.00 / 3) (#21)
by Captain Derivative on Sat Oct 12, 2002 at 03:09:03 PM EST

It seems to me that a somewhat better solution would be to not cache stories when they are requested by the Googlebot.  The Googlebot uses a distinctive User-Agent header, so it shouldn't be hard to identify when it is the one requesting a page.  When the Googlebot requests a page, don't cache the story to disk.  (Of course, if it's already been cached, go ahead and use that.)  That way getting crawled by Google won't fill up your disks, without turning off caching altogether.

One caveat is that this will only detect the Googlebot and not other search engine crawlers, many of which aren't as well-behaved as the Googlebot as far as identifying themselves goes.  Only caching stories if the User-Agent header is from an actual web browser (as opposed to a crawler) could work better in the general case; it shouldn't be hard to pattern-match the User-Agent headers from the major browsers (IE, Netscape, Mozilla, Opera, Konquerer, etc.), and obscure browsers shouldn't be generating enough traffic to cause much concern.

Hey! Why aren't you all dead yet?! Oh, that's right, it's only Tuesday. -- Zorak

[ Parent ]
Not if they follow robots.txt (none / 0) (#24)
by ShadowNode on Sat Oct 12, 2002 at 03:43:35 PM EST

As kuro5hin seems to block everything but Google and a couple others that don't look familiar.

[ Parent ]
Well-Behavedness (4.00 / 1) (#30)
by Captain Derivative on Sat Oct 12, 2002 at 05:26:48 PM EST

According to the robots.txt file, all but four crawlers are banned from indexing the site.  Of course, that doesn't stop rogue robots from ignoring the robots file and crawling the entire site anyway.  But since that doesn't seem to be a problem, there's no need to take additional steps to prevent it from happening.

Even still, Googlebot is allowed access and managed to cause the cache to fill the hard drive.  If you prevented caching when one of the Allowed Four Robots is requesting the story, you should make it much more unlikely to have the disk fill up with cached stories again.  The logic is that, if it's just a crawler accessing the story, it's not particularly likely someone else is going to access that story in the near future.  If it's a human, it's much more likely someone else will request it too.

Of course, as someone else suggested, it may also be a good idea to keep the cache files on a separate partition of the hard drive, so that a filled cache doesn't hurt other parts of the system.

Hey! Why aren't you all dead yet?! Oh, that's right, it's only Tuesday. -- Zorak

[ Parent ]
Better Yet... (none / 0) (#46)
by ph317 on Sun Oct 13, 2002 at 11:41:59 PM EST

Why not just put a rate limiter on the caching code?  In the part of the code that creates new cache entries when neccesary, track the rate and limit it to some arbitrary value (say 20 new cache entries per minute or something), and if the rate is exceeded, no new cache entries get created until the rate goes below some low water mark (say half the trigger rate).  Keep updating the rate even when you're in the no-new-entries mode.  If you know the average size of a cache entry, the available disk space for caching, and the expiry rate of the cache, you should be able to come up with a rate limit that fairly well ensures you won't fill the disk.

Or alternatively and perhaps even easier, limit the cache to X% of the disk space and have it overwrite older entries with newer ones based on the least recently used.  Just one two-field table with the cache entry id and the date it was last accessed by a client that gets updated on each cache serve.  Index on the date to make it easy to find the oldest one to replace.

[ Parent ]

Couldn't you write a hook for this? (none / 0) (#36)
by dram on Sat Oct 12, 2002 at 07:22:30 PM EST

I don't know the full power of hooks, but would it be possible to make one that deleted the cache when disk usage got above 90% or something? Or maybe write a cron for it? I don't know what the best/easist way would be, but maybe it's something that should be looked into.


[ Parent ]
ABUSE OF POWER (1.04 / 42) (#7)
by Hired Goons on Sat Oct 12, 2002 at 12:17:37 PM EST

I see this wasn't voted on.

So rusty is above the story queue? Looks like democracy on K5 is dead.
You calling that feature a bug? THWAK

Cool (3.66 / 3) (#10)
by TheophileEscargot on Sat Oct 12, 2002 at 12:37:28 PM EST

Good to have K5 back again.

So I guess if you want to be a good K5er, at busy times you should read it logged-out. Might get you the pages faster too...
Support the nascent Mad Open Science movement... when we talk about "hundreds of eyeballs," we really mean it. Lagged2Death

monitoring (5.00 / 2) (#14)
by johnnyfever on Sat Oct 12, 2002 at 01:24:29 PM EST

what about some monitoring software so the disks can't fill up without someone knowing? If our production systems get much over 90%, the appropriate people are called out via sms, paged, or emailed immediately. We use Big Brother, but there are plenty of options available.

Why weren't you automatically alerted? (4.00 / 4) (#18)
by strlen on Sat Oct 12, 2002 at 02:25:22 PM EST

$_ = `df usr`;
if (/100%
) {
        system("echo Disk full|sendmail rustyscell@rustyPhoneCompany")

Was there anything like that present in the crontab? In addition there's tools like netsaint which will do above, in addition to great deal more automatically. Are any of these being run on K5's server?

[T]he strongest man in the world is he who stands most alone. - Henrik Ibsen.

gar.. formating (5.00 / 1) (#34)
by strlen on Sat Oct 12, 2002 at 06:32:14 PM EST

here's the updated, also smaller version, and with proper formating:

`df /usr`;
if (/100\%/) {
 system("/usr/bin/printf \"Disk full \\n. \\n\"|sendmail address_for_SMS");

[T]he strongest man in the world is he who stands most alone. - Henrik Ibsen.
[ Parent ]

Best make that 90% (5.00 / 1) (#43)
by Scrymarch on Sun Oct 13, 2002 at 07:25:27 AM EST

After all, you want to be warned when things are about to fall over, not just when they do.

[ Parent ]
Ouch (5.00 / 1) (#48)
by bafungu on Mon Oct 14, 2002 at 12:30:20 PM EST

It pains me to see people resorting to Perl where a plain old shell script is more than adequate. The following will email the sysadmin when /usr goes over 90% full:

df /usr | grep -q ' 9[0-9]%' && mail -s 'Disk full!' sysadmin < /dev/null

[ Parent ]

Rusty's a carpenter... (4.50 / 2) (#20)
by imrdkl on Sat Oct 12, 2002 at 03:04:29 PM EST

Hey, I didn't know you could actually do something useful. Not only that, but other famous men were carpenters. Back before I got a real job, I did electrician work, sheetrocking, and even some insulator work, until the itching drove me out of that trade.

Feh, the downtime didn't bother me (much), but wondering about the legalities and paperwork you have in front of you makes me wretch, a little.

So, you've got my email, if there's anything I can do to help. I took a hard look at non-profits about 15 years ago, when I writing a couple of business plans, but I don't have much American reference material here.

I've still got my toolbelt, too.

Useful? Me? (4.50 / 2) (#27)
by rusty on Sat Oct 12, 2002 at 04:26:35 PM EST

Hey, I didn't know you could actually do something useful.

It's not obvious is it? :-) I'm a kind-of carpenter, to be honest. I'm perfectly capable of wielding the tools, and building small stuff, but Bill's the guy who knows what all the building codes are and how to not do something really stupid (most of the time). Between the two of us, we do some good work.

I'm ruined here because I don't have a shop. The back wheel fell off my scooter today (I shit you not), and now I'm contemplating banging some kind of shop together because there's no way I can fix it without some sort of covered space with a bench, especially with winter coming on.

Not the real rusty
[ Parent ]

A home shop (none / 0) (#28)
by imrdkl on Sat Oct 12, 2002 at 04:45:55 PM EST

could generate some income, too. You could make cute wooden toys and whatnot, and perhaps even sell them on K5, albeit with some limited whining about it.

I'd suggest a workmate, but they're too hard on the back. Better to build your bench to your own spec, then build the shed around it.

If you could reduce the problem-fixes which you most often utilize to sudo() scripts, then you might not need be so worried about handing out some monitoring duties, btw. Not everything needs root.

[ Parent ]

hello!? (5.00 / 1) (#59)
by levsen on Mon Oct 14, 2002 at 07:24:51 PM EST

I thought we all tossed in some money in order to have some guy run the web site as a full time job, and then the thing breaks because he has to go babysitting or whatever.

I also thought the money drive was there to demonstrate that the relation between the site admins and its users can be taken from a hobbyist-take-it-or-leave-it to a professional, commercial level. Profit or not, it's commercial and professional.

If at this point in time already the money gone then the experiment failed, no problem, onto other things. But the way rusty puts it he was to noble to touch the money. Hello, that the site doesn't run if you don't get money is already proven. We WANT you to take it and DO something with it to prove that it makes a difference.

Please clarify upon this before my subscription runs out in 5 days.

This comment is printed on 100% recycled electrons.
[ Parent ]

He's being frugal. (5.00 / 1) (#65)
by Inoshiro on Wed Oct 16, 2002 at 01:56:14 AM EST

Maybe a bit too frugal, but I applaud it. If there's anything we at K5 (the people behind it, myself, Rusty, Hurstdog, etc) have learned, it's that we have to be frugal. Money doesn't just appear out of nowhere, unfortunately. Subscriptions are just making ends meet right now.

[ イノシロ ]
[ Parent ]
There's the problem (5.00 / 1) (#69)
by theboz on Thu Oct 17, 2002 at 04:38:42 PM EST

Subscriptions are just making ends meet right now.

The problem is that as more and more subscriptions expire that were started with the fundraiser, they will not be renewed due to the technical problems of the site. Sometimes you have no choice but to spend money. I think right now may be one of those times, or at least Rusty could be spending time working on the problem. I think that's what the person you replied to was complaining about; he specifically gave money so K5 would be maintained. Instead, Rusty spent a week building his friend's porch. There's a lot more to this, but I think that the money is really a side issue. Instead of using the money raised to maintain K5, Rusty is off doing menial labor so he can scrape up some money. He suffers, the users suffer, and K5's future suffers.

[ Parent ]

Jesus was a carpenter... (none / 0) (#53)
by vectro on Mon Oct 14, 2002 at 05:01:40 PM EST

Though whether or not his work was useful in the long run is up for debate.

“The problem with that definition is just that it's bullshit.” -- localroger
[ Parent ]
So was Harrison Ford (5.00 / 1) (#55)
by rusty on Mon Oct 14, 2002 at 05:14:47 PM EST

And I don't think there's any doubt about the value of his work. :-)

Not the real rusty
[ Parent ]
Yeah, but does Rusty (none / 0) (#57)
by libertine on Mon Oct 14, 2002 at 06:31:45 PM EST

have enough cool piercings to measure up for the job? */me ducks*

"Live for lust. Lust for life."
[ Parent ]
Just one (5.00 / 1) (#60)
by rusty on Mon Oct 14, 2002 at 07:28:48 PM EST

And it's not on an extremity.

We're all going to hell for this thread you know. I can't wait till the first K5 Hell Get-Together. spiralx will get ripped on horse tranquilizers and accidentally burn down the eighth circle.

Not the real rusty
[ Parent ]

Not on an extremity? (5.00 / 2) (#62)
by Captain_Tenille on Tue Oct 15, 2002 at 05:31:33 PM EST

What, do you have a piercing through your rib cage or something? Last I checked, most piercings tended to be through an extremety of some sort.
/* You are not expected to understand this. */

Man Vs. Nature: The Road to Victory!
[ Parent ]

Not all... (5.00 / 1) (#64)
by rodgerd on Wed Oct 16, 2002 at 12:28:10 AM EST

My sister in law has a piercing than runs down her sternum (or more accurately, through the skin above it). So no, not all piercings are through extremities.

[ Parent ]
slashdotted... kuroded... (5.00 / 4) (#44)
by phraggle on Sun Oct 13, 2002 at 11:03:02 AM EST


cron (5.00 / 1) (#45)
by evro on Sun Oct 13, 2002 at 12:33:36 PM EST

Sorry if this sounds ignorant, but why not have a cron job that cleans out these cached pages every n minutes?

Also, if ($HTTP_USER_AGENT =~ /^Googlebot/) maybe treat it as a logged-in user? That's probably a less good solution, but I ended up doing something similar for my company's page because googlebot was throwing off our logs.
"Asking me who to follow -- don't ask me, I don't know!"

I'll type slowly. (3.66 / 3) (#47)
by Shren on Mon Oct 14, 2002 at 12:03:23 PM EST

The cached pages keep the server from having to build the dynamic pages all the time. This is done because the cached pages are faster and less memory intensive than the dynamic pages. If the server can't load all of the cached pages for google, then it's sure as hell not going to be able to load all of the dynamic pages for google.

Just because cacheing wasn't enough doesn't mean that cacheing is bad. Bugged, maybe, but not bad. If running on your legs doesn't help you catch the bus, you don't saw off your legs and try it on your hands. You try to be on time for the bus next time.

[ Parent ]

Actually, that's not quite right. (none / 0) (#51)
by vectro on Mon Oct 14, 2002 at 04:57:25 PM EST

The problem wasn't that the server couldn't meet Google's request speed. The problem is that Google left behind a bunch of cached pages that filled up the disk. So if google were served dynamic pages, the problem would go away, because the dynamic pages would not generate cache entries.

Really, though, the right solution is to use a cache replacement policy that keeps the cache under a certain size; for example, an LRU (or even FIFO) cache ought to work fine.

“The problem with that definition is just that it's bullshit.” -- localroger
[ Parent ]

Right. (none / 0) (#56)
by Shren on Mon Oct 14, 2002 at 05:40:11 PM EST

Regardless of the exact problem with the cache, however, the solution isn't to shoot it every hour or so. That was my main point.

[ Parent ]

Why not? (none / 0) (#70)
by Pikachu with an Axe in his Head on Thu Dec 12, 2002 at 11:31:45 PM EST

Most caching mechanisms have a way to maintain a disk size limit, and it usually involves killing the oldest n things or anything over a certain age.

[ Parent ]
You should read as slowly as you type. (5.00 / 1) (#68)
by evro on Wed Oct 16, 2002 at 04:07:04 PM EST

The problem had nothing to do with the site not being able to generate pages quickly enough.  It had to do with the site creating too many cached pages.  My suggestion was to have a cron job that runs every n minutes and delete all cached pages that have not been accessed in >= x minutes.  If Googlebot is the only accessor of a page in the past 2 hours, it's probably safe to delete it and let it be created again the next time it's requested.  Caching is great for a document that's being requested 100 times a minute; for a document that's being requested < once an hour it probably costs more in disk space than it saves in CPU time.
"Asking me who to follow -- don't ask me, I don't know!"
[ Parent ]
What Does 'df' Do? | 70 comments (70 topical, 0 editorial, 0 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!