The IETF RFC series - the famous "Request For Comments" documents - provides the foundation for the Internet as we know today. This extensive collection goes back as far as 1969, and it keeps growing. As of today, we have more than 3400 "official" RFCs in the IETF index.
This formidable collection of documents poses a problem for the student. Several documents are only of historic value, being written in the early days of the Internet. Others were of some importance, but were later made obsolete by newer documents. Some RFCs were never implemented in practice. Others deserve a mention, if only because they permit the student to understand the prevailing social conventions, such as the April Fool's Day RFCs. All these factors contribute to make the study of the RFC collection needlessly hard for the neophite. Some type of guide, or index, pointing to the most relevant RFCs, would be handy in these situations.
In such a large collection of documents, it is indeed very difficult to pick a few documents as 'fundamental'. Any particular choice would seem biased. So we decided to develop a ranking methodology to evaluate the relative relevance of every RFC published so far.
In order to build our ranking, we carried an exaustive research on the Internet, looking for occurrences of textual references to the RFCs by number. The individual queries are very simple. Our query look for patterns such as 'RFCxxxx OR "RFC xxxx"'. We also searched for variations, such as RFC references without the leading zeroes.
The process was automated with a Python script.
Thanks to the formidable Google API, the research proved relatively painless to perform - although we had to limit the daily number of queries to 1000, conforming to Google's API license agreement. The automation script saved intermediate results, keeping track of all work done, so we could perform a limited number of queries at a time. The script itself is very simple, but it'ss beyond the scope of this article. It may be made available for anyone interested, or posted later, if there is enough interest. After a few days, we had the results, which we present now.
The top 20 RFCs, by relevance:
For reasons of space, we decided to limit the list to the first 20 RFCs. These RFCs alone concentrate about 11,37% of all references found, for all RFCs. It is very interesting to note the topics on the list above. In the top 20, we have nearly all the basics - SMTP, DNS, FTP, HTTP, and more important, the TCP/IP protocol itself. The relative prevalence of email-related RFCs is also worthy noting. Last but not least, the RFCs that describe the Internet standard process themselves are also highly relevant, as it is RFC1918, which describes the address scheme used for private internets. The pair of RFC2616 and RFC2068 could be said to form a single entry, although the same could be said of RFC0822 and RFC1822 (which is well positioned at 42th in the overall list); the older edition is still the most relevant RFC, according to the results of our research.
Other relevant data we could extract from our research: the top 20% RFCs concentrate 60% of all references. If we get the top 44% of all RFCs, we have 80% of all references.
- RFC0822: Standard for the format of ARPA Internet text messages. (~65500 hits)
- RFC2119: Key words for use in RFCs to Indicate Requirement Levels. (~44400 hits)
- RFC2026: The Internet Standards Process. (~37300 hits)
- RFC2045: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. (~35400 hits)
- RFC1521: MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. (~ 31700 hits)
- RFC1213: Management Information Base for Network Management of TCP/IP-based internets: MIB-II. (~ 29600 hits)
- RFC0791: Internet Protocol. (~ 29200 hits)
- RFC0821: Simple Mail Transfer Protocol. (~ 28900 hits)
- RFC1123: Requirements for Internet Hosts - Application and Support. (~ 27400 hits)
- RFC0793: Transmission Control Protocol. (~ 26700 hits)
- RFC1157: Simple Network Management Protocol (SNMP). (~ 26400 hits)
- RFC1522: MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text. (~ 26100 hits)
- RFC1035: Domain names - implementation and specification. (~ 24700 hits)
- RFC1034: Domain names - concepts and facilities. (~ 22700 hits)
- RFC1738: Uniform Resource Locators (URL). (~ 22700 hits)
- RFC0959: File Transfer Protocol. (~ 22200 hits)
- RFC1918: Address Allocation for Private Internets. (~ 22100 hits)
- RFC2616: Hypertext Transfer Protocol -- HTTP/1.1. (~ 21700 hits)
- RFC2068: Hypertext Transfer Protocol -- HTTP/1.1. (~ 20900 hits)
- RFC1155: Structure and identification of management information for TCP/IP-based internets. (~ 20800 hits)
Pitfalls and shortcomings
As with any such study, we need to ask: is the information obtained really representative of the relevance of the RFC?
First of all, one could argue that older RFCs have the advantage, for the simple reason that they have collected more references overtime. The presence of RFC0822 at the top of the list is one possible proof for this argument. However, while this can lead to some distortions, it still does not affect the final result as a measurement of the relative relevance of the documents;it's just natural that older documents are more relevant, if only in historical terms. One possible solution to minimize this problem is to look for obsolete RFCs in the list, and then substitute them for the newer version.
It is also important to note that there are other ways to search for similar data that may lead to different results. Another such methodology is as follows: use the Google engine to count links to the actual official RFC locations on the IETF web site (for example, all the links pointing back to http://www.ietf.org/rfc/rfc1918.txt). The problem with this approach is that other repositories - for example, the one at faqs.org also concentrate a lot of references, making this research much more difficult. Given the options, we felt that the strategy we had chosen was a good compromise, and that it could give us a reasonable result.
The results of our study clearly proves some things that we already suspected:
The study also brought us a few surprises. For example, entry 21 (not shown in the list above) is RFC1483 - Multiprotocol over ATM, whose link count was probably boosted as result of its relevancy in broadband access equipment such as xDSL and cable modems. Many SNMP-related RFCs appear nearly the top of the list, showing the importance of SNMP and related MIBs.
- Relatively few RFCs concentrate most of the links on the Internet. That's expected for any sufficiently large collection of documents, it should be no different with RFCs.
- The 'basic' RFCs - those of more general interest, and of broadest focus - are closer to the top of the list.
- With few exceptions, RFCs with too narrow scope tend to appear at the bottom of the list.
We also regard this study as a starting point only. Lots of information can be extracted, both from the data that was saved from the queries done, and even more from new queries. For example, the RFC index itself lists relevant cross-information between the RFCs, such as cross references and which RFCs were obsoleted by newer documents. All those information could be taken into account to better understand the results.
Even in light of all shortcomings, the list of the 'top 20' RFCs is a good starting point for students, as it shows what documents are currently relevant for the study of the Internet. We do recomend this list for any student that is looking for pointers to start reading the RFC collection.
[Editor Note] P.S.: Some people may point that this study is worthy an "Ig Nobel" prize. In fact, if it was not for the relative ease to perform the study - thanks to the fabulous Google API - I would think twice before working on it. Anyway, now that it's done, why not share the results with the community?