What is metadata?
In case I lost you in that opening paragraph, what I’m talking about here is the concept of “data about data,” or, more accurately, information which talks about itself a bit. Simple examples of metadata surround us all the time: in the process of fetching this page, for example, your browser was probably told how large a file it is (some number of bytes), what type of file it is (text marked up with HTML), and what language it’s written in (English). Even though it’s usually invisible to you, this information and more can and should be sent with every single web page, because this sort of metadata is very obviously useful. Telling the browser how big the file is lets your computer set aside memory for it, specifying the type of file helps figure out what type of program should deal with it, and stating the language up-front lets your computer know that the page should be displayed using the Western alphabet rather than, say, Chinese characters.
And it’s common to go much further than this; for example, if you have a page about vineyards in Roanoke, Virginia, you probably wouldn’t be content to just give it a descriptive title:
Knowing that HTML provides the
<meta> element for expressing certain types of metadata, you’d probably also add some more information:
<meta name="description" content="Vineyards in and around Roanoke, Virginia, and other local wine-related info">
<meta name="keywords" content="wine, wine-growing, wine-tastings, vineyards, vintages, Roanoke, Virginia">
This will help search engines to categorize your page and return it as a result in relevant searches, which is generally considered a good thing, good enough that at least half and likely far more than half of the things you see on the web will use this kind of simple, effective metadata. But
<meta> tags are far from revolutionary and, as metadata goes, they’re still kid stuff.
There’s metadata and then there’s metadata
Real Web Authors are into the sort of hardcore metadata that (they think) HTML simply isn’t built to handle. They want to be able to express more interesting information than “this is a page about vineyards,” and they come up with some pretty complex and interesting ways to do it. To take a simple example, consider a statement like “Alice is Bob Jones’ friend.” This is potentially useful information (it can be used for networking, for “vouching” for a new acquaintance, and so on), but how could it be expressed in a way that, say, a search engine could understand? One solution is FOAF (Friend Of A Friend) markup. FOAF is based on RDF, a metadata format standard from the W3C, and it can get a little unwieldy. For example, if Bob wanted to create a simple FOAF document explaining that Alice is his friend, it would look something like this:
That translates, roughly, to “I’m Bob Jones, and Alice is my friend.” And, up until very recently, even very knowledgeable people in the field would have told you that there was simply no way to express that sort of metadata in HTML; hence we need the Semantic Web and technologies like RDF and OWL. Or at least, we need them if we assume there’s nothing in HTML which can provide this functionality; as it turns out, that assumption is wrong.
Link relationships and metadata
HTML currently provides two elements for linking to particular resources: the workhorse is the
a element, which is what most of us mean when we’re talking about links. There’s also the
link element, which the HTML spec says “conveys relationship information that may be rendered by user agents in a variety of ways” and which is responsible for most of the stylesheets and “favicons” on the web. For example, the following indicates that the file
style.css is the page’s stylesheet:
<link rel="stylesheet" href="style.css" type="text/css">
The way a browser knows that this is the stylesheet is, of course, by the content of the
rel attribute, which is where the magic happens. The spec says that
rel should be a list of “link types,” and provides a variety of types to choose from as needed. And
rel can also be applied to links created with
a; for example, to provide a link back to a site’s index:
<a rel="index" href="index.html">Home</a>
Some browsers already provide a menu of navigation options based on link relationships in the page, and Mozilla’s link pre-fetching feature will pre-load the next page in a sequence if it finds a link with
rel="next"; these are interesting features which require very little work to use — adding a
rel attribute to a link is actually a wonderfully easy way to go about this. But the list of link types in the HTML specification is pretty sparse, given the huge number of possible relationships between pages. And that’s where metadata profiles come in.
The forgotten attribute:
Here’s a quick quiz to amuse your markup-savvy friends: without looking at the HTML specification, how many different attributes can you think of which legally apply to a
<head> tag? The answer is three:
profile. You can be forgiven if you didn’t get any of them, and especially if you didn’t get
profile; it’s the attribute that time forgot. But it’s also the attribute which makes rich, “hardcore” sorts of metadata possible in pure HTML.
profile simply “specifies the location of one or more meta data profiles” for the page, but that’s where the magic is. A metadata profile isn’t hard to create, and is even easier to use. The specification doesn’t actually outline the format of a metadata profile; the Dublin Core profile is a highly detailed document, but the very popular XFN profile is much simpler, and the XHTML Meta Data Profiles tutorial page provides a simple definition list as a sample profile. Returning to the example of Alice and Bob, Bob could avoid all that messy FOAF markup he created earlier by using the XFN (that’s “XHTML Friends Network”) profile with his page and linking to Alice’s site like so:
<a rel="friend" href="http://example.com/alice/">Alice</a>
And they said it couldn’t be done in HTML.
And a metadata profile can specify other types of information besides link types: the Dublin Core profile specifies information to be inserted in
<meta> tags, and the
scheme attribute in HTML allows for easy interpretation of troublesome formats like dates (use of
scheme makes it possible, for example, to determine whether 09-11-2001 refers to September 11 or November 9).
The possibilities of profiles
Common use of profiles would make it possible to express a vast array of information without having to resort to convoluted, heavily-abstracted solutions like RDF; think of it as metadata for the people. For example, XFN was the first I ever heard of metadata profiles (and, I imagine, the first that a lot of people heard of them) and has become extremely popular in the weblogging community. With XFN’s profile, it’s possible and actually downright easy to turn a blogroll into much more than a list of links; for example, a simple script can index the pages of a group of people who all use XFN and build a map of their relationships to each other. From there it’s a simple step to being able to answer questions based on those relationships, and other interesting applications.
And the potential doesn’t stop with keeping track of your social circle; there are even more interesting possibilities coming to light. For example, Ian Hickson (formerly of the Mozilla project, now working for Opera), recently wondered aloud about ways to fight spam comments on weblogs and other sites which allow open commenting:
I’m thinking that HTML should have an element that basically says "content within this section may contain links from external sources; just because they are here does not mean we are endorsing them" which Google could then use to block Google rank whoring.
The idea, of course, is that without the benefit of increased PageRank there would be much less incentive to post spam comments. A Web developer named Lachlan Hunt saw that comment and posted some ideas on using metadata profiles to implement this; for example, a profile could define an “unendorsed” relationship, which search engines could look for and use to adjust their ranking calculations, with an unendorsed link providing little or no benefit to the page linked.
Lachlan also proposed a number of other interesting link relationships which would be handy in everyday use; sites like k5 and Slashdot could use
rel="member-only" to indicate links to sites like The New York Times which require registration to view articles, and using the relationship
comment would make it easy to quickly distinguish links to external pages from links to comments in a discussion forum. Altogether he has quite a list, ranging over accessible versus inaccessible sites; kid-friendly versus adult-themed; and plenty of others which could lead to interesting searching and cataloging utilities if commonly used.
So it seems that metadata profiles can solve problems previously thought intractable in pure HTML; this would be a huge step forward for useful metadata on the web if it got into widespread use, but at the moment the technology is far too obscure. Resolving this would require several things to happen:
- First and foremost, people need to be made aware that this technology exists; that's one of my main motivations for writing this article. The
profile attribute has been sitting around in the HTML spec for years without much notice or use, but once people know about it the ease of use compared to other metadata solutions should help its popularity quite a bit.
- Second, support for metadata profiles needs to be built in to common web tools; weblogging systems are implementing XFN, but there needs to be support in more mainstream content-management systems and also web composing applications (Dreamweaver, Frontpage, etc.), and it needs to not be specific to one profile.
- Finally, a standard format defined for writing metadata profiles needs to be defined, the simpler the better. This could be a good project for the newly-formed WHAT-WG to tackle; it could get a specification out fairly quickly, and once that was established it would be much simpler to lobby for support and introduce new users .
The future will probably be full of metadata one way or another, but the technology to do it simply and efficiently in HTML exists today in the form of metadata profiles; all that's needed is for it to get into wider use and “the future” will be here.
- Well, really it’s the character encoding, not the language, which determines the alphabet used, but character encoding is a whole ’nother article. Let’s keep this one simple and on-topic.
- Or at least it will for some search engines. Due to rampant abuse (read: stuffing completely unrelated keywords into your page in hopes of coming up in more searches), there are a lot of search engines which ignore or don’t give much weight to keywords found in
<meta> tags. Moral of the story: don’t piss in the metaphorical metadata well.
- I say “roughly” because I don’t think I can really explain the abstractions of RDF adequately for an article on this level. If you’re interested in learning exactly what’s going on in all that markup, I recommend you pick up a good book on RDF.
- For reasons which I don’t fully understand, it’s not possible to apply
class; the “common” attribute set isn’t specified as applicable to
- For example, if Bob was supposed to meet Alice but couldn’t remember where the party was, he could ask a search engine for the latest weblog entries of Alice’s friends and see if one of them mentioned it. This is the sort of application the Semantic Web working groups dream about, and it would be pretty simple to implement.