As I write this the Perl language is at version 5.6 and yet, a large amount of Perl code written in the days of 4.x will still run happily with little to no modification. This is a tribute to the concept of backwards compatibility, something that the mongers of Perl honour with reverence, but are there alternatives to backwards compatibility? The most obvious alternative is to throw away old code. Another alternative is to not update your compiler/interpreter. One can happily use Perl 3 today as long as one wants not for the features and support of Perl 5.
It is unfortunate that sometimes languages die. The community around the Perl language is quite strong, so this may not apply, but in the case of small or application specific languages, a lack of support can lead to programmers abandoning the language. If one has a lot of money invested in source code written in this language it can be difficult or expensive to find programmers able to maintain or extend this source code.
A solution to both these problems presents itself in source to source translation. A program written in Perl 3 can be translated to a program written in Perl 5 with an automatic translator. A program written in an arcane or unsupported language (including assembler languages) can be translated to a language cheaper to maintain. The translator is likely to be somewhat limited and will probably have to be hand coded, although there are tools like compiler compilers which can help. This leads me to my first dream tool:
readcode is a (non-existing) program which extracts the meaning of a program given a specification for the language in which it is written. This program requires some explanation, so allow me to digress for a moment and define the acronym YINACC: Yacc Is Not A Compiler Compiler. At best, Yacc is a parser compiler. It takes a representation of Backus Naur Form and C code and generates a table based parser. The parser depends on a scanner to break the input stream up into tokens. The program lex is usually used to make the scanner. Other tools exist that solve some of Yacc's problems: non LALR and ambiguous grammars, EBNF notation, languages other than C. All of these have been an issue to me in the past but the later is, I think, most telling.
A Yacc file is essentially a declaration of the syntax of a programming language using a set of rules and alternatives. The Yacc engine does some magic to transform the grammar into something that can be efficiently parsed and outputs a finite state machine. For each rule in the file the programmer can attach "side effects" which are executed when the rule is matched. The side effects essentially define the semantics of the language (in the case of a one pass compiler) or build some internal representation which is then transformed by the compiler proper to turn that input program into something executable by a machine.
It is absolutely true that some compilers (if not all) have no real "understanding" of what it is the program means. However, some attempts at specification based compilation (such as the New Jersey Machine Code Toolkit and the related compiler projects, and to a lesser degree, the Gnu Compiler Collection) have blurred the truth of this assertion. They've done this by adding to the grammar file a specification of the semantics of the language as well as the syntax. One could say the specification tells the compiler how to read a given source file. The compiler forms an internal understanding of the program and can then manipulate it to perform sophisticated optimisations and generate outputs.
This front end of a compiler on steroids has been called a "fact extractor" by some in the reverse engineering community. Although language specific a very promising project in this area is CPPX, a fact extractor for the C++ language based on GCC. There are a number of things wrong with CPPX, not the least of which is that it extracts facts from pre-processed C++ code. Any facts that may be gathered from preprocessing directives are lost, as are formatting and comments. For the case of a compiler this is inconsequential, but other reverse engineering applications require this kind of information, so CPPX isn't much use for them.
Surely there is a better language than C to specify the semantics of a programming language but if we are to replace C in specification files, what should we replace it with? I think it is obvious that a declarative programming language is what we need, as clearly we are declaring the semantics of the language. This is, however, deceptive. The apparent completeness here hides some unhappy truths about information loss. True, we can define the semantics of any language by defining the exact untyped lambda calculus which when instantiated and evaluated will yield the correct result, but how much information about the source language are we losing? A language like Java hides many important semantics in its superficially simple syntax. The constraints of the type system, for example, vastly outweigh the imperative object oriented execution. Are we to add these constraints to the specification in this low level format also?
lang2lang is a (non-existing) program which translates any arbitrary computer language to any other computer language, maintaining comments, formatting and symbols where-ever possible. There are some things of importance to note here:
- the tool does not discriminate between languages. If you give procedural code (like say, C code) to the tool and request a functional program be generated in some language such as Haskel, the tool is expected to be able to do it.
- the tool is good for reverse as well as forward engineering. Again, if I give procedural code and ask for object oriented code (say, C to Java) I will expect to get some reasonable clustering of procedures into classes.
- the tool generates good code. What is "good code?" Is it objectifiable or is good code purely a subjective concept? Below I will have more to say on code quality, but for the moment, let's just say that we can specify what is good code in some given language and the tool is expected to meet these requirements.
One could imagine the tool being used in more than the situations we have outlined above. One can even imagine the tool being used on a daily basis. Perhaps I have written a lot of code and it occurs to me that a different language would have been more appropriate. Today, changing languages in mid-project is a significant investment and the pay-off is anything but measurable. Perhaps I simply dislike or don't know a language in which I have obtained a large portion of code.
coderate is a (non-existing) program which evaluates the quality of the source code of a program written in an arbitrary computer language. What is code quality? I would define code quality as the degree to which the source code of a program conveys the meaning/functionality/semantics to a reader of that source code. Some people define code quality as an inverse function to the amount of time it takes for a skilled programmer who is unfamiliar with the source code to extend or fix a flaw in the program. There are other definitions, some objective, some subjective, but one can envision the possibility that all are somehow machine describable. That is, there is some way that each of us with a concept of "good code" can write a specification for at least some of what we are looking for in a well written program.
Some programs like this tool already exist, although they are largely language specific and not specifiable. As we have already stated, opinions on code quality differ. The writer of lint, a static analyser of C programs, may have very different opinions to me as to what makes good C code, and he/she appears to have no opinion on the quality of java programs.
The use of this tool is more than academic. When making source code acquisitions (or IP-centric mergers) a company will often evaluate a program solely on its functionality. This can lead to disproportionate estimations of the acquisitions value when more money must be spent later adapting the source code to the needs of its new owner.
improvecode is a (non-existing) program which performs source to source translations to improve code quality. As a logical next step, how hard is it to transform code as to maintain the same semantic meaning whilst maximizing some specified metric. Compiler developers will immediately recognise this as the core of an optimising compiler. However, unlike an optimising compiler (most of which perform only local conservative optimisations) the tool will understand the code to a degree such that it can perform not only interprocedural analysis but actual algorithmic complexity analysis.
Suppose you write a program which, for some reason or another, performs a naive search in a linked list for an element and returns the result. The complexity of using this data structure is suboptimal at a linear function of the number of elements to be searched. A better data structure would be a tree or a hash table. These are the kinds of concerns that typically dog programmers who are concerned with performance. To perform the optimisation a programmer must know something about the likely input to the program, ie. the number of elements likely to be in the list. This is because at small numbers of elements a linked list may be more efficient than a tree or a hash table. If we can provide this kind of information to our tool, it can perform the optimisation for us.
codequery is a (non-existing) program which answers questions about source code. The field of program understanding is a reverse engineering subject interested in developing techniques to help programmers understand source code faster or better. It draws from many fields including program visualisation, compilation and slicing technologies, debuggers and profilers. Although most of these tools are rarely used by programmers, their value is enormous to anyone who has inherited a large code base or is working on something which is simply too big to keep all in one's head.
A field that has yet to be seriously incorporated into program understanding is computer reasoning. Indeed, a similar field to program understanding, text understanding, has been attacking the same problems from a different perspective. Text understanding has a much loftier goal than program understanding. Rather than supplying tools to aid humans to understand content, text understanding aims to write a program that, in some sense, itself understands the content.
For example, a text understanding program may be given a novel to read. A series of questions may then be queried of the novel. Who did Alice follow into Wonderland? Why did Alice accept tea from the Mad Hatter? When did Alice's discontent with the inverted value system of Wonderland first become apparent? To answer all these questions the program truly must understand both the book and the question and possess some vast array of common (and not so common) knowledge.
The major barriers to text understanding today lie in the difficulties of parsing the english language. Ambiguity, missing information about inflection, tone, etc. A vast array of techniques have been developed to overcome these problems (some of which could be applied to programming languages with interesting results, such as probabilistic parsing) but perhaps the techniques already developed can be applied to languages without these problems. In fact, suggestions such as this have been made in the text understanding community and have lead to constructed natural languages such as Lojban which has an unambiguous grammar, phonetic spelling, regular rules and claims to be culturally neutral.
However this raises another problem: there's nothing written in Lojban (yet), at least not enough to warrant the construction of a text understanding program focused on it. Languages that are easily parsable and do have a lot written in them are programming languages. Once we apply the principles of text understanding to programming languages we get something truly interesting. Perhaps we can even teach a program to code.
code2spec is a (non-existing) program which extracts a high level formal specification from source code. Again, the tool is language independant. I should state up front that this dream tool is in some ways already a reality. Software Migrations Ltd will, for a fee, take your Assembler, C or COBOL code and transform it into a Wide Spectrum Language which can then be abstracted into a formal specification. Formal specifications people may be a little confused at this point. I seem to be advocating a backwards process of formal methods: writing the specification after writing the code is bad enough, so automatically generating it must be paramount to heresy. The use I see for such a tool is simple: summaries. Any tool which can take a million lines of code and generate a smaller specification stating what the program is doing is an aid to program understanding. The reverse engineering step tells me what the program is doing. If I modify the specification I will expect to be able to "compile" the specification back down to code. Using such a tool I could fix a bug in a program without ever writing a line of code.
code4me is a (non-existing) program which writes code, fixes bugs, and bakes cookies for you. Ok, I'm lying about the cookies, but it is about the level of credence given to such an idea, and not without just cause. To be a good (or even mediocre) programmer a program would have to be able to read and understand not only source code but also the english language. Or would it? Using the previous tool we expect to be able to hand some useful information to the program that is automatically generated. Could we not add more information in machine readable formats? Be they higher or lower level than specifications we can baby sit our tool until it produces acceptable code. Using our coderate tool we can even automate the baby sitting.
At this point a lot of people are fearing for their jobs. Is history doomed to repeat itself, as factory workers were (supposedly) ousted by robot workers, are programmers to be ousted by their programs? To answer this question I will turn to the old standby: creativity. Too much I hear that programmers are artists, although as artists we spend most of our day hacking out code to do the same old things. Code reuse, dynamic programming, new programming languages, are all symptoms of us trying to throw off the shackles of actually having to program. If machines can write the programs for us, then what will we do, other than tell them what to write. We've answered our own question. What we'll do is tell them what to write in a super-declarative fashion.
I've largely focused on source transformations, compilers, and what I suppose someone might call AI. This is because these are the thoughts that largely dominate my day holed up, as I am, at the Centre for Software Maintanence at the University of Queensland. There I am working on a decompiler: a real life dream tool that gives the user a high level source view and navigation of a program's binary, ie. one that has been compiled and the source code is no longer available. The problem is not solvable in the general case, it's equivilent to the halting problem, but that's ok, a partial solution is better than no solution at all.