Why should a new filesystem be considered a watershed event for programming? Filesystems are a known quantity aren't they?
- Allow the creation, modification, and deletion of directories.
- Create, modify, and delete files in directories.
- Make a file or directory available from multiple locations (symbolic and hard links).
- Record when the file is created.
- Record the last time a file is modified.
- Record the size of entries.
- Keep track of which users are allowed to access and write to a file or directory.
- Store and retrieve files as quickly as possible.
As long as they serve the files quickly and avoid losing data, what more could anyone ask for? Sure, there can be variation in the implementation. For example, some filesystems support ACLs in addition to standard, single block, UNIX-style 32-bit permissions. But no matter how one implements a filesystem, it's a static bit repository. According to the dinosaurs such as the Minix filesystem, the FAT filesystem, the ext filesystem and its descendants, this is the case. It has even been the case for the newcomers JFS, XFS, ReiserFS, and their ilk. Speeds change. The amount that can be stored has increased dramatically. Reliability in adverse conditions has improved. No fundamental shifts in usage however. This is the limited role of a filesystem isn't it? Isn't it?
WHY FILESYSTEMS ARE THE WAY THEY ARE TODAY
All filesystems contain meta information -- data about data: a file's size, a directory's creation timestamp, etc. This list of meta information is fixed, however. If I wanted to add a "usefulness" attribute to every filesystem entry for example, I would have to patch the filesystem and quite likely the kernel to allow access to this attribute. In addition, I might break other applications. At the very least, I would have to recompile every application that called the program/routine stat.
When confronted with issues of custom meta information, filesystem architects have largely told application developers to shove off. And for good reason. Very few would assert that the album, artist, and song information in an MP3 file is unimportant. But this information is customarily so small that a great deal of space on the hard disc is wasted. Filesystems are organized into "blocks" of data and these blocks usually contain only one file or directory entry. In order to effectively access these blocks, they are made a uniform size. If your meta information is only 400 bytes, allocating a 4 kilobyte block of storage is gratuitous. Even in today's climate where 80GB drives are normal and larger are commonplace, no one wants to run out of space when the disc is really only half full. In some situations, where the block size is 32 kilobytes for example, the situation is even worse. ReiserFS had tackled the problem with tail packing, but the common usage of the
notail mount option is testament to the space vs. performance tradeoff.
Thus ID3 tags were born. Rather than save independent files that are less than 100 bytes in length, they are appended to the MP3 file in a custom data structure, a miniature filesystem if you will. This has caused a few problems over time. People wanted more information than could be provided by ID3, so ID3v2 and its ilk were born. Unfortunately, there were already quite a few programs that understood ID3v1 meta information. In order to add more information, the additional information had to be added in front of the ID3v1 information but, again, after the audio data. Every revision of the ID3 spec has had to do this. To make matters worse, once Internet access became fast enough to stream audio data, online jukeboxes and news streams started to proliferate. Matters are worse because ID3 meta information is at the end of a MP3 audio file. An MP3 file couldn't just be shipped over the wire unchanged as it wouldn't do to find out what was playing until after the song was over and the next one was just starting. Most of us want to know while the song is playing. Thus the spec was further massaged to allow sending these attributes at the beginning of the file...as long as you were streaming.
The same balkanization can be found in image files and the way they store copyright, authorship, comments, and other such meta information. The way comments are stored in a GIF image is markedly different from storage in a JPEG, TIFF, or PNG image ad thus requires a completely different codebase to handle each.
WHY DON'T FILESYSTEMS JUST HANDLE SMALL FILES BETTER?
Some filesystems have tried. It has been a major topic of discussion in academia and among commercial operating system developers for almost as long as there have been filesystems. All the testing, the research, and the math has pointed to a marked loss of performance -- many times a drastic loss in performance. It hardly seemed worth it when existing files were at least 4 kilobytes in length more than 90% of the time. (Of course, if all meta information were stored as independent files, it would be far less than 90% of the time.)
Another stumbling block in the quest for the handling of smaller files is one of data management. After all is said and done, even if filesystems could efficiently handle very small files for metadata, a programmer would run into association issues. Specifically, the directory of hundreds or thousands of MP3s would now contain hundreds or thousands of files that describe the MP3 files. If an MP3 file were deleted or renamed, the corresponding meta information would suddenly describe a file that didn't exist. If one were ever to copy a song, all the associated files would have to be remembered as well. What a pain! Better to attach the info somehow. Once again, ID3 tags seem sensible.
WHAT IS A FILE ANYWAY?
This is an easy one. A file is a sequence of bytes. A directory is a collection of files. Simple, right?
But why is this so? Why must there be a strict distinction? Why cannot a resource be both a file and a directory at the same time? Wait a second! That's crazy talk! How would that work?
From a developer perspective, it's quite simple. Let's say you have a file called
foo.mp3 in your home directory. If I try to access
~/foo.mp3, I get the file as is expected. If however I try to access
~/foo.mp3/, I get a directory entry; I get a list of file attributes, meta data. Now our MP3 file could have information that would normally be included in an ID3 tag but accessible through normal file access utilities like Emacs, vi, gedit, etc.
Sounds great, so why hasn't anyone done this. Surely I'm not the first person to mention this. In fact, some people have done this in the past. It never quite caught on due to efficiency reasons on the filesystem that were cited above.
NO! ME FIRST!
Another major limitation of existing filesystems is one of atomicity. One of the most common security issues in software today is the race condition. A quick glance at Bugtraq would show you the scale of this problem. Many system administrators are familiar with the security issues for setuid scripts. UNIX-like systems open executable files and check to see how they are to be executed. A file may be an ELF binary, an a.out binary, a Perl script, etcetera. If a file begins with the characters '#!', the file is passed to an interpreter for processing. For example, if a file begins with the following:
a UNIX-like system will open the file, find that it has what is called a "magic line", and run the script through the program
/bin/csh, the C-shell. A file is read to find the interpreter to be used, the permissions on that file are checked, and the interpreter is invoked with the script as input. Seems innocent enough unless a script kiddie has good timing. Imagine that after the permissions were read but before the script is read in and passed to the interpreter, the contents of the file were changed? Suddenly you have arbitrary code running.
Another problem is the use of temporary files. A reasonably careful program will check to see if a file exists prior to creating a file. But let's say that program has setuid root rights. It checks the temp directory for the existence of a file. It doesn't exist. A malicious gremlin makes a symbolic link of the filename to
/etc/passwd just in the nick of time. Then the setuid program creates and writes to the temporary file. Oops! Say goodbye to your user manifest and password file.
Currently, tricks like the use of fsync and file renaming are used to simulate atomic file operations in current filesystems. In the end, they can be effective, but they impart an undue burden on the developer to handle correctly. This is assuming of course that the developer even knows how to use these tricks...or that the tricks are even necessary.
These attacks can be difficult -- although some programs have made them entirely too easy -- but if a filesystem had support for transactions similar to the functionality of advanced databases, they could be avoided. Imagine being able to link multiple operations into a single atomic component. The programmer could conceivably make the check for file existence, open, and write and be assured that the outside world was isolated from the process. The fix for race conditions becomes only a couple of lines of filesystem transaction code instead of POSIX hacks that rely on side effects to function properly.
WAIT! I'M NOT DONE YET!
Atomicity also rears its head when performing multiple, related operations. Let's say that you have a queue where you want to delete one file, create another, and update a log. With transactions, it would be possible to enforce that all operations succeed or none succeed. No more checking for dangling files when the program crashed in the middle of an operation. If the replacement file creation or the log addition fails, the source file would not be deleted.
So why don't filesystems do this already if these are such good ideas? Once again, speed concerns raise their ugly heads. So far, the speed loss for what are a minority of filesystem accesses has outweighed the development cost passed on to application developers.
THE TIMES THEY ARE A CHANGING
But what would happen if the efficiency issues were no longer an obstacle? What if a filesystem could be made that could store files of arbitrarily small size without wasting 90% of available storage? What would happen if that filesystem was called ReiserFS v4 (hereafter referred to as Reiser4)?
Recently on the Linux kernel mailing list, Grant Miner posted some speed comparisons on ext3, JFS, XFS, and Reiser4. CPU usage for Reiser4 is higher than the others but so is the performance. Even with its superior handling of very small files, treatment of entries simultaneously as files and directories, transaction support, and coupled with the fact that this is pre-release software that is very much still in development, it is faster than any other filesystem on Linux!
It doesn't take a huge leap of logic to assume that if these speed advances are corroborated by others independently and the filesystem proves itself stable enough for production use with vital data, it will supplant ReiserFS v3 and ext3 as the dominant filesystem used on Linux. If this comes to pass, many programs will be written to take advantage of Reiser4's advances.
/etc/passwd for example. As keeping passwords out in the open is a bad idea for security -- even if they are encrypted -- many distributions use shadow password files. The original file which contains the user IDs, home directories, default shells, etc. is left publicly readable. The password hashes however are placed in the
shadow file which is not publicly readable. So far, I haven't told anyone who's spent any appreciable amount of time on a UNIX-like system anything new.
Keeping in mind everything you've read about the capabilities of Reiser4, the ability to efficiently manage small files, if
/etc/passwd were to be split up into multiple files, one for each user, each file could have permissions set that would match the user. A
shadow file would no longer be necessary as the user's password hash does not need to be protected from the user; They already presumably know their password.
But this, of course, breaks older programs that read and edit the password file. It is for this reason (among others) that Reiser4 supports filesystem plugins. One such plugin is a directory aggregator. Whereas before it was mentioned that a file could also be a directory, a directory can appear as a file. In this case, the directory would appear as an aggregation of files separated by a delimiting character (such as '\n'). Taking this example to its logical conclusion, each field could be a separate file. So to find a user's preferred shell (in this example, user ID is 107), one would access the file
/etc/passwd/107/shell. Changing the shell would be a simple matter of:
echo '/bin/bash' > /etc/passwd/107/shell
Since user ID 107 owns the
/etc/passwd/107/ directory and all of its descendants, a setuid program/script isn't necessary. Seeing possibilities yet?
A convenient side effect of a modular plugin-based filesystem and the blurring of what is a file (versus an attribute or a directory) is that Reiser4 helps to fully realize the UNIX goal of making everything a file: even the directories are files.
I CAN'T BELIEVE NO ONE HAS DONE THIS YET
Some have done similar things in there filesystems or supplementary utilities. The functionality is similar to fsattr on Solaris, XFS's Extended Attribute, and others that aren't as widely used. The primary difference between Reiser4's implementation and these others is that an attribute is not fundamentally different from a "normal" file; No changes need to be made to existing programs and libraries in order to take advantage. There is no limit to name or content that does not already exist for directories and files. However, there is an interface (not yet finalized) that allows access to ReiserFS attributes with a single file descriptor and call to fopen in a manner similar to the Solaris and XFS. Without this interface, a file descriptor and a call to fopen would be necessary to retrieve each attribute -- a certain performance killer for multiple attributes.
BeFS made great strides toward implementing the filesystem as a database but ultimately scaled back their ambitions to attributes and indexed access.
MacOS has had extended file attributes as well, but not truly at the filesystem level. An application layer was added to the system that tracked attributes. Work at Microsoft for the next version of their operating system similarly has scaled back their initially ambitious plans for the filesystem to function at a higher level.
However all of these efforts underscore an industry-wide trend. Metadata belongs with the data, not with the application.
WHY WOULDN'T IT BE USED?
Inertia. It can't be ignored. MS Outlook has been one of the most exploited vector for viruses ever seen (if not the most exploited) and yet a large number of people still use it. Assuming another program came along with all of MS Outlook's functionality but none of the security holes, it would still take some time for people to make the transition.
The fact that Resier4 would still be POSIX compliant -- would still support last modified timestamps, file permissions, et al. -- makes the transition easier. But systems would have to be upgraded and some of those systems have been up for years and are still running kernel version 2.2 or earlier.
Assumptions with regard to future application development will have to be revisited. Software developers are no different from most other segments of the population. When they find something that works for them and has worked for years, they will not likely give those practices up easily.
How do you convince a mail server author that each message can be a separate file and the mail headers can be attributes of that file when conventional wisdom for years has stated quite the opposite? But imagine how much simpler the daemon would be if it didn't have to spend so much time managing data or parsing through a mailbox separated by ^G control characters.
All audio files could have the same interface for attributes without regard to whether they have the extension *.mp3, *.ogg, *.wav, or anything else under the sun. ID3 tags and their equivalents in other file formats would be redundant and unnecessary code complexity.
Suppose you could embed the MIME type in the file attributes directly instead of relying on a three-letter extension when serving files to the web; The web server would just know.
Discard excessive usage of fsync and weird file renaming hacks while increasing security.
Depending on how much you care for writing new versions of old programs, Reiser4 could be either one of the greater boons to software engineers or one of the greater banes. Either way, programming on Linux will likely never be the same. Reiser4 may be the first, but I doubt it will be the last. Rather, I think the ante just got upped for filesystems and Reiser4 is raising the table.
You can read more at the ReiserFS homepage. There you can find more information about Reiser4, its design goals, filesystem transactions, and blurring the line between files and directories.