History of Decompilation

Decompilers have been written for a variety of applications since the development of the first compilers. The very first decompiler was written by Joel Donnelly in 1960 at the Naval Electronic Labs to decompile machine code to Neliac on a Remington Rand Univac M-460 Countess computer. The project was supervised by Professor Maurice Halstead, who worked on decompilation during the 1960s and 70s, and published techniques which form the basis for today's decompilers. It is for this reason that we dedicate this page to the memory of Prof Maurice Halstead and award him the title of Father of Decompilation.

Throughout the last decades, different uses have been given to decompilers. In the 1960s, decompilers were used to aid in the program conversion process from second to third generation computers; in this way, manpower would not be spent in the time-consuming task of rewriting programs for the third generation machines. During the 70s and 80s, decompilers were used for the portability of programs, documentation, debugging, re-creation of lost source code, and the modification of existing binaries. In the 90s, decompilers have become a reverse engineering tool capable of helping the user with such tasks as checking software for the existence of malicious code, checking that a compiler generates the right code, translation of binary programs from one machine to another, and understading of the implementation of a particular library function.

The following descriptions illustrate the best-known decompilers and/or research performed into decompiler topics by individual researchers or companies. Most of these descriptions first appeared in Cristina Cifuentes' PhD thesis Reverse Compilation Techniques. They are reproduced herein with permission from the author.

1960, 1962 D-Neliac decompiler
1963-7 The Lockheed Neliac decompiler
1966 W.Sassaman
1967 Autocoder to Cobol translator
1972-6 The Inverse Compiler
1973 C.R.Hollander PhD thesis
1973 B.C.Housel PhD thesis
1974 The Piler System
1974 F.L.Friedman PhD thesis
1974 Ultrasystems
1974 V.Schneider and G.Winiger
1977,1981,1988 Decompilation of Polish code
1978 G.L.Hopwood PhD thesis
1978 D.A.Workman
1981 Zebra
1982 Decompilation of DML programs
1982,1984 Forth Decompiler
1984 Dataflex Decompilers
1985 Software transport system
1988 J.Reuter's Decomp VAX BSD decompiler
1990 Austin Code Works' exe2c DOS decompiler
1991 PLM-90 decompiler
1991 O'Gorman's PhD thesis
1991-94 Decompiler compiler
1991-93 8086 C Decompiling System
1993 Alpha AXP Migration Tools
1993 Source/PROM Comparator
1994 The dcc decompiler
1999 A. Mycroft's Type Reconstruction for Decompilation

D-Neliac decompiler, 1960

As reported by Halstead in [Hals62], the Donnelly-Neliac (D-Neliac) decompiler was produced by J.K.Donnelly and H.Englander at the Navy Electronics Laboratory (NEL) in 1960. Neliac is an Algol-type language developed at the NEL in 1955. The D-Neliac decompiler produced Neliac code from machine code programs; different versions were written for the Remington Rand Univac M-460 Countess computer and the Control Data Corporation 1604 computer.

D-Neliac proved useful for converting non-Neliac compiled programs into Neliac, and for detecting logic errors in the original high-level program. This decompiler proved the feasibility of writing decompilers.

The Lockheed Neliac Decompiler, 1963-7

The Lockheed decompiler, also known as the IBM 7094 to Univac 1108 Neliac Decompiler helped in the migration of scientific applications from the IBM 7094 to the newer Univac 1108 at Lockheed Missiles and Space. Binary card images were translated to Univac 1108 Neliac, as well as assembly code (if available). Binary code was produced from Fortran applications. As reported in [Hals70].

Halstead analyzed the implementation effort required to raise the percentage of correctly decompiled instructions half way to 100%, and found that it was approximately equal to the effort already spent [Hals70]. This was because decompilers from that time handled straightforward cases, but the harder cases were left for the programmer to consider. In order to handle more cases, more time was required to code these special cases into the decompiler, and this time was proportionately greater than the time required to code simple cases.

W.Sassaman, 1966

Sassaman developed a decompiler at TRW Inc., to aid in the conversion process of programs from 2nd to 3rd generation computers. This decompiler took as input symbolic assembler programs for the IBM 7000 series and produced Fortran programs. Binary code was not chosen as input language because the information in the symbolic assembler was more useful. Fortran was a standard language in the 1960s and ran on both 2nd and 3rd generation computers. Engineering applications which involved algebraic algorithms were the type of programs decompiled. The user was required to define rules for the recognition of subroutines. The decompiler was 90% accurate, and some manual intervention was required [Sass66].

This decompiler makes use of assembler input programs rather than pure binary code. Assembler programs contain useful information in the form of names, macros, data and instructions, which are not available in binary or executable programs, and therefore eliminate the problem of separating data from instructions in the parsing phase of a decompiler.

Autocoder to Cobol Conversion Aid Program, 1967

Housel reported on a set of commercial decompilers developed by IBM to translate Autocoder programs, which were business data processing oriented, to Cobol. The translation was a one-to-one mapping and therefore manual optimization was required. The size of the final programs occupied 2.1% times the core storage of the original program [Hous73].

This decompiler is really a translation tool of one language to another. No attempt is made to analyze the program and reduce the number of instructions generated. Inefficient code was produced in general.

The Inverse Compiler, 1972-6

The Inverse compiler, also known as the Sperry*Univac 494 to 1100 inverse compiler, was developed at Univac to aid in the migration to newer machines, including business applications (i.e. decompilation to COBOL). This decompiler was based on the Lockheed Neliac decompiler.

C.R.Hollander, 1973

Hollander's PhD dissertation [Holl73] describes a decompiler designed around a formal syntax-oriented metalanguage, and consisting of 5 cooperating sequential processes; initializer, scanner, parser, constructor, and generator; each implemented as an interpreter of sets of metarules. The decompiler was a metasystem that defined its operations by implementing interpreters.

The initializer loads the program and converts it into an internal representation. The scanner interacts with the initializer when finding the fields of an instruction, and interacts with the parser when matching source code templates against instructions. The parser establishes the correspondence between syntactic phrases in the source language and their semantic equivalents in the target language. Finally, the constructor and generator generate code for the final program.

An experimental decompiler was implemented to translate a subset of IBM's System/360 assembler into an Algol-like target language. This decompiler was written in Algol-W, a compiler developed at Stanford University, and worked correctly on the 10 programs it was tested against.

This work presents a novel approach to decompilation, by means of a formal syntax-oriented metalanguage, but its main drawback is precisely this methodology, which is equivalent to a pattern-matching operation of assembler instructions into high-level instructions. This limits the amount of assembler instructions that can be decompiled, as instructions that belong to a pattern need to be in a particular order to be recognized; intermediate instructions, different control flow patterns, or optimized code is not allowed. In order for syntax-oriented decompilers to work, the set of all possible patterns would need to be enumerated for each high-level instruction of each different compiler. Another approach would be to write a decompiler for a specific compiler, and make use of the specifications of that compiler; this approach is only possible if the compiler writer is willing to reveal the specifications of his compiler. It appears that Hollander's decompiler worked because the compiler specifications for the Algol-W compiler that he was using were known, as this compiler was written at the University where he was doing this research. The set of assembler instructions generated for a particular Algol-W instruction were known in this case.

B.C.Housel, 1973

Housel's PhD dissertation [Hous73] describes a clear approach to decompilation by borrowing concepts from compiler, graph, and optimization theory. His decompiler involves 3 major phases: partial assembly, analyzer, and code generation.

The partial assembly phase separates data from instructions, builds a control flow graph, and generates an intermediate representation of the program. The analyzer analyzes the program in order to detect program loops and eliminate unnecessary intermediate instructions. Finally, the code generator optimizes the translation of arithmetic expressions, and generates code for the target language.

An experimental decompiler was written for Knuth's MIX assembler (MIXAL), producing PL/1 code for the IBM 370 machines. 6 programs were tested, 88% of the instructions were correct, and the remaining 12% of the instructions required manual intervention [Hous73b].

This decompiler proved that by using known compiler and graph methods, a decompiler could be written that produced good high-level code. The use of an intermediate representation made the analysis completely machine independent. The main objection to this methodology is the choice of source language, MIX assembler, not only for the greater amount of information available in these programs, but for being a simplified non-real-life assembler language.

The Piler System, 1974

Barbe's Piler system attempts to be a general decompiler that translates a large class of source--target language pairs to help in the automatic translation of computer programs. The Piler system was composed of three phases: interpretation, analysis, and conversion. In this way, different interpreters could be written for different source machine languages, and different converters could be written for different target high-level languages, making it simple to write decompilers for different source--target language pairs. Other uses for this decompiler included documentation, debugging aid, and evaluation of the code generated by a compiler.

During interpretation, the source machine program was loaded into memory, parsed and converted into a 3-address microform representation. This meant that each machine instruction required one or more microform instructions. The analyzer determined the logical structure of the program by means of data flow analysis, and modified the microform representation to an intermediate representation. A flowchart of the program after this analysis was made available to users, and they could even modify the flowchart, if there were any errors, on behalf of the decompiler. Finally, the converter generated code for the target high-level language [Barb74].

Although the Piler system attempted to be a general decompiler, only an interpreter for machine language of the GE/Honeywell 600 computer was written, and skeletal converters for Univac 1108's Fortran and Cobol were developed. The main effort of this project concentrated on the analyzer.

The Piler system was a first attempt at a general decompiler for a large class of source and target languages. Its main problem was to attempt to be general enough with the use of a microform representation, which was even lower-level than an assembler-type representation.

F.L.Friedman, 1974

Friedman's PhD dissertation describes a decompiler used for the transfer of mini-computer operating systems within the same architectural class [Frie74]. Four main phases are described: pre-processor, decompiler, code generator, and compiler.

The pre-processor converts assembler code into a standard form (descriptive assembler language). The decompiler takes the standard assembler form, analyses it, and decompiles it into an internal representation, from which FRECL code is then generated by the code generator. Finally, a FRECL compiler compiles this program into machine code for another machine. FRECL is a high-level language for program transport and development; it was developed by Friedman, who also wrote a compiler for it. The decompiler used in this project was an adaptation of Housel's decompiler [Hous73].

Two experiments were performed; the first one involved the transport of a small but self-contained portion of the IBM 1130 Disk Monitor System to Microdata 1600/21; up to 33% manual intervention was required on the input assembler programs. Overall, the amount of effort required to prepare the code for input to the transport system was too great to be completed in a reasonable amount of time; therefore, a second experiment was conducted. The second experiment decompiled Microdata 1621 operating system programs into FRECL and compiled them back again into Microdata 1621 machine code. Some of the resultant programs were re-inserted into the operating system and tested. On average, only 2% of the input assembler instructions required manual intervention, but the final machine program had a 194% increase in the number of machine instructions.

This dissertation is a first attempt at decompiling operating system code, and it illustrates the difficulties faced by the decompiler when decompiling machine-dependent code. Input programs to this transport system require a large amount of effort to be presented in the format required by the system, and the final produced programs appear to be inefficient; both in the size of the program and the time to execute many more machine instructions.

Ultrasystems, 1974

Hopwood reported on a decompilation project at Ultrasystems, Inc., in which he was a consultant for the design of the system [Hopw78]. This decompiler was to be used as a documentation tool for the Trident submarine fire control software system. It took as input Trident assembler programs, and produced programs in the Trident High-Level Language (THLL) that was being developed at this company. Four main stages were distinguished: normalization, analysis, expression condensation, and code generation.

The input assembler programs were normalized so that data areas were distinguished with pseudo-instructions. An intermediate representation was generated, and the data analyzed. Arithmetic and logical expressions were built during a process of expression condensation, and finally, the output high-level language program was generated by matching control structures to those available in THLL.

This project attempts to document assembler programs by converting them into high-level language. The fact is, given the time constraints of the project, the expression condensation phase was not coded, and therefore the output programs were hard to read, as several instructions were required for a single expression.

V.Schneider and G.Winiger, 1974

Schneider and Winiger presented a notation for specifying the compilation and decompilation of high-level languages. By defining a context-free grammar for the compilation process (i.e. describe all possible 2-address object code produced from expressions and assignments), the paper shows how this grammar can be inverted to decompile the object code into the original source program [Schn74]. Even more, an ambiguous compilation grammar will produce optimal object code, and will generate an unambiguous decompilation grammar. A case study showed that the object code produced by the Algol 60 constructs could not be decompiled deterministically. This work was part of a future decompiler, but further references in the literature about this work were not found.

This work presents, in a different way, a syntax-oriented decompiler [Holl73]; that is, a decompiler that uses pattern matching of a series of object instructions to reconstruct the original source program. In this case, the compilation grammar needs to be known in order to invert the grammar and generate a decompilation grammar. Note that no optimization is possible if it is not defined as part of the compilation grammar.

Decompilation of Polish code, 1977, 1981, 1988

Two papers in the area of decompilation of Polish code into Basic code are found in the literature. The problem arises in connection with highly interactive systems, where a fast response is required to every input from the user. The user's program is kept in an intermediate form, and then ``decompiled'' each time a command is issued. An algorithm for the translation of reverse Polish notation to expressions is given in [Balb79].

The second paper presents the process of decompilation as a two step problem: the need to convert machine code to Polish representation, and the conversion of Polish code to source form. The paper concentrates on the second step of the decompilation problem, but yet claims to be decompiling Polish code to Basic code by means of a context-free grammar for Polish notation and a left-to-right or right-to-left parsing scheme [Bert81].

This technique was recently used in a decompiler that converted reverse Polish code into spreadsheet expressions [May88]. In this case, the programmers of a product that included a spreadsheet-like component wanted to speed up the product by storing user's expressions in a compiled form, reverse Polish notation in this case, and decompile these expressions whenever the user wanted to see or modify them. Parentheses were left as part of the reverse Polish notation to reconstruct the exact same expression the user had input to the system.

The use of the word decompilation in this sense is a misuse of the term. All that is being presented in these papers is a method for re-constructing or deparsing the original expression (written in Basic or Spreadsheet expressions) given an intermediate Polish representation of a program. In the case of the Polish to Basic translators, no explanation is given as to how to arrive at such an intermediate representation given a machine program.

G.L.Hopwood, 1978

Hopwood's PhD dissertation [Hopw78] describes a 7-step decompiler designed for the purposes of transferability and documentation. It is stated that the decompilation process can be aided by manual intervention or other external information.

The input program to the decompiler is formatted by a preprocessor, then loaded into memory, and a control flow graph of the program is built. The nodes of this graph represent one instruction. After constructing the graph, control patterns are recognized, and instructions that generate a goto statement are eliminated by the use of either node splitting or the introduction of synthetic variables. The source program is then translated into an intermediate machine independent code, and analysis of variable usage is performed on this representation in order to find expressions and eliminate unnecessary variables by a method of forward substitution. Finally, code is generated for each intermediate instruction, functions are implemented to represent operations not supported by the target language, and comments are provided. Manual intervention was required to prepare the input data, provide additional information that the decompiler needed during the translation process, and to make modifications to the target program.

An experimental decompiler was written for the Varian Data machines 620/i. It decompiled assembler into MOL620, a machine-oriented language developed at University of California at Irvine by M.D.Hopwood and the author. The decompiler was tested with a large debugger program, Isadora, which was written in assembler. The generated decompiled program was manually modified to recompile it into machine code, as there were calls to interrupt service routines, self-modifying code, and extra registers used for subroutine calls. The final program was better documented than the original assembler program.

The main drawbacks of this research are the granularity of the control flow graph and the use of registers in the final target program. In the former case, Hopwood chose to build control flow graphs that had one node per instruction; this means that the size of the control flow graph is quite large for large programs, and there is no benefit gained as opposed to using nodes that are basic blocks (i.e. the size of the nodes is dependent on the number of changes of flow of control). In the latter case, the MOL620 language allows for the use of machine registers, and sample code illustrated in Hopwood's dissertation shows that registers were used as part of expressions and arguments to subroutine calls. The concept of registers is not a high-level concept available in high-level languages, and it should not be used if wanting to generate high-level code.

D.A.Workman, 1978

This work describes the use of decompilation in the design of a high-level language suitable for real time training device systems, in particular the F4 trainer aircraft [Work78]. The operating system of the F4 was written in assembler, and it was therefore the input language to this decompiler. The output language was not determined as this project was to design one, thus code generation was not implemented.

Two phases of the decompiler were implemented: the first phase, which mapped the assembler to an intermediate language and gathered statistics about the source program, and the second phase, which generated a control flow graph of basic blocks, classified the instructions according to their probable type, and analyzed the flow of control in order to determine high-level control structures. The results indicated the need of a high-level language that handled bit strings, supported looping and conditional control structures, and did not require dynamic data structures or recursion.

This work presents a novel use of decompilation techniques, although the input language was not machine code but assembler. A simple data analysis was done by classifying instructions, but did not attempt to analyze them completely as there was no need to generate high-level code. The analysis of the control flow is complete and considers 8 different categories of loops and 2-way conditional statements.

Zebra, 1981

The Zebra prototype was developed at the Naval Underwater Systems Centre in an attempt to achieve portability of assembler programs. Zebra took as input a subset of the ULTRA/32 assembler, called AN/UYK-7, and produced assembler for the PDP11/70. The project was described by D.L.Brinkley in [Brin81].

The Zebra decompiler was composed of 3 passes: a lexical and flow analysis pass, which parsed the program and performed control flow analysis in the graph of basic blocks. The second pass was concerned with the translation of the program to an intermediate form, and the third pass simplified the intermediate representation by eliminating extraneous loads and stores, in much the same way described by Housel [Hous73, Hous73b]. It was concluded that it was hard to capture the semantics of the program and that decompilation was economically impractical, but it could aid in the transportation process.

This project made use of known technology to develop a decompiler of assembler programs. No new concepts were introduced by this research, but it raised the point that decompilation is to be used as a tool to aid in the solution of a problem, but not as tool that will give all solutions to the problem, given that a 100% correct decompiler cannot be built.

Decompilation of DML programs, 1982

A decompiler of database code was designed to convert a subset of Codasyl DML programs, written with procedural operations, into a relational system with a nonprocedural query specification. An Access Path Model is introduced to interpret the semantic accesses performed by the program. In order to determine how FIND operations implement semantic accesses, a global data flow reaching analysis is performed on the control flow graph, and operations are matched to templates. The final graph structures are remapped into a relational structure. This method depends on the logical order of the objects and a standard ordering of the DML statements [Katz82].

Another decompiler of database code was proposed to decompile well-coded application programs into a proposed semantic representation is described in [Dors82]. This work was induced by changes in the use requirements of a Database Management System (DBMS), where application programs were written in Cobol-DML. A decompiler of Cobol-DML programs was written to analyse and convert application programs into a model and schema-independent representation. This representation was later modified or restructured to account for database changes. Language templates were used to match against key instructions of a Cobol-DML programs.

In the context of databases, decompilation is viewed as the process of grouping a sequence of statements which represent a query into another (nonprocedural) specification. Data flow analysis is required, but all other stages of a decompiler are not implemented for this type of application.

Forth Decompiler, 1982, 1984

A recursive Forth decompiler is a tool that scans through a compiled dictionary entry and decompiles words into primitives and addresses [Dudl82]. Such a decompiler is considered one of the most useful tools in the Forth toolbox [Hill84]. The decompiler implements a recursive descent parser so that decompiled words can be decompiled in a recursive fashion.

These works present a deparsing tool rather than a decompiler. The tool recursively scans through a dictionary table and returns the primitives or addresses associated with a given word.

Dataflex Decompilers, 1984

DataFlex is a macro-based language. Some macros include over 80 DataFlex commands in one macro command. The Database Managers company Dataflex Decompilers have the capability of recovering the highest-level macro command instead of the low-level commands that compose such a macro. The techniques used in this decompiler include pattern matching and the recovery of control structures such as if's and loops. The generated code is functionally equivalent to the original source and is guaranteed to be recompilable without changes.

Software Transport System, 1985

C.W.Yoo describes an automatic Software Transport System (STS) that moves assembler code from one machine to another. The process involves the decompilation of an assembler program for machine m1 to a high-level language, and the compilation of this program in a machine m2 to assembler. An experimental decompiler was developed on the Intel 8080 architecture; it took as input assembler programs and produced PL/M programs. The recompiled PL/M programs were up to 23% more efficient than their assembler counterpart. An experimental STS was developed to develop a C cross-compiler for the Z-80 processor. The project encountered problems in the lack of data type in the STS [Woo85].

The STS took as input an assembler program for machine m1 and an assembler grammar for machine m2, and produced an assembler program for machine m2. The input grammar was parsed and produced tables used by the abstract syntax tree parser to parse the input assembler program and generate an abstract syntax tree (AST) of the program. This AST was the input to the decompiler, which then performed control and data flow analyses, in much the same way described by Hollander [Holl73], Friedman [Frie74], and Barbe [Barb74], and finally generated high-level code. The high-level language was then compiled for machine m2.

This work does not present any new research into the decompilation area, but it does present a novel approach to the transportation of assembler programs by means of a grammar describing the assembler instructions of the target architecture.

Decomp, 1988

J.Reuter wrote decomp, a decompiler for the Vax BSD 4.2 which took as input object files with symbolic information and produced C-like programs. The nature of this decompiler was to port the Empire game to the VMS environment, given that source code was not available. The decompiler is freely available on the Internet [Reut88].

Decomp made use of the symbol table to find the entry points to functions, determine data used in the program, and the names of that data. Subroutines were decompiled one at a time, in the following way: a control flow graph of basic blocks was built and optimised by the removal of arcs leading to intermediate unconditional branches. Control flow analysis was performed in the graph to find high-level control constructs, converting the control flow graph into a tree of generic constructs. The algorithm used by this analysis was taken from the struct program, a program that structures graphs produced by Fortran programs, which was based on the structuring algorithm described by B.Baker in [Bake77]. Finally, the generic constructs in the tree were converted to C-specific constructs, and code was generated. The final output programs required manual modifications to place the arguments on the procedure's argument list, and determine that a subroutine returned a value (i.e. was a function). This decompiler was written in about 5 man-months [Reut91].

Sample programs were written and compiled in C in a Vax BSD 4.2 machine. The resulting C programs are not compilable, and require some hand editing. The programs have the correct control structures, due to the structuring algorithm implemented, and the right data type of variables, due to the embedded symbol table in the object code. The names of library routines and procedures, and the user's program entry point are also known from the symbol table; therefore, no extraneous procedures (e.g. compiler start up code, library routines) are decompiled. The need for a data flow analysis stage is vital, though, as neither expressions, actual arguments, nor function return value are determined. An interprocedural data flow analysis would eliminate much of the hand-editing required to recompile the output programs.

exe2c, 1990

The Austin Code Works sponsored the development of the exe2c decompiler, targetted at the PC compatible family of computers running the DOS operating system [Aust91]. The project was announced in April 1990 [Guth90], tested by about 20 people, and it was decided that it needed some more work to decompile in C. A year later, the project reached a beta operational level [Guth91a], but was never finished [Guth91b]. I was a beta tester of this release.

exe2c is a multipass decompiler that consists of 3 programs: e2a, a2aparse, and e2c. e2a is the disassembler. It converts executable files to assembler, and produces a commented assembler listing as well. e2aparse is the assembler to C front-end processor, which analyzes the assembler file produced by e2a and generates .cod and .glb files. Finally, the e2c program translates the files prepared by a2aparse and generates pseudo-C. An integrated environment, envmnu, is also provided.

Programs decompiled by exe2c make use of a header file that defines registers, types and macros. The output C programs are hard to understand because they rely on registers and condition codes (represented by Boolean variables). Normally, one machine instruction is decompiled into one or more C instructions that perform the required operation on registers, and set up condition codes if required by the instruction. Expressions and arguments to subroutines are not determined, and a local stack is used for the final C programs. It is obvious from this output code that a data flow analysis was not implemented in exe2c. This decompiler has implemented a control flow analysis stage; looping and conditional constructs are available. The choice of control constructs is generally adequate. Case tables are not detected correctly, though. The number and type of procedures decompiled shows that all library routines, and compiler start-up code and runtime support routines found in the program are decompiled. The nature of these routines is normally low-level, as they are normally written in assembler. These routines are hard to decompile as, in most cases, there is no high-level counterpart (unless it is low-level type C code).

This decompiler is a first effort in many years to decompile executable files. The results show that a data flow analysis and heuristics are required to produce better C code. Also, a mechanism to skip all extraneous code introduced by the compiler and to detect library subroutines would be beneficial.

PLM-80 Decompiler, 1991

The Information Technology Division of the Australian Department of Defence researched into decompilation for defence applications, such as maintenance of obsolete code, production of scientific and technical intelligence, and assessment of systems for hazards to safety or security. This work was described by S.T. Hood in [Hood91].

Techniques for the construction of decompilers using definite-clause grammars, an extension of context-free grammars, in a Prolog environment are described. A Prolog database is used to store the initial assembler code and the recognised syntactic structures of the grammar. A prototype decompiler for Intel 8085 assembler programs compiled by a PLM-80 compiler was written in Prolog. The decompiler produced target programs in Small-C, a subset of the C language. The definite-clause grammar given in this report was capable of recognizing if-then control structures, and while loops, as well as static (global) and automatic (local) variables of simple types (i.e. character, integers, and longs). A graphical user interface was written to display the assembler and pseudo-C programs, and to enable the user to assign variable names, and comments. This interface also asked the user for the entry point to the main program, and allowed him to select the control construct to be recognized.

The analysis performed by this decompiler is limited to the recognition of control structures and simple data types. No analysis on the use of registers is done or mentioned. Automatic variables are represented by an indexed variable that represents the stack. The graphical interface helps the user document the decompiled program by means of comments and meaningful variable names. This analysis does not support optimized code.

J. O'Gorman PhD thesis, 1991

The Systematic Decompilation thesis by John O'Gorman [OGor91], University of Limerick, describes a pattern matching technique used for decompiling VAX binaries into Pascal source code. The technique requires the availability of the compiler used, performs a coverage of constructs available in the language, and creates small test programs that use the constructs, in order to derive the patterns of machine code used for each high-level construct. When decompiling a Pascal executable, the patterns are matched to determine which Pascal construct to recreated. Unoptimized code is used.

The thesis is available for downloading (ftp) in postscript format: Technical Report UL-91-12.ps.

Decompiler compiler, 1991-1994

A decompiler compiler is a tool that takes as input a compiler specification and the corresponding portions of object code, and returns the code for a decompiler; i.e. it is an automatic way of generating decompilers, much in the same way that yacc is used to generate compilers [Bowe91a, Bowe91b, Breu94].

Two approaches are described to generate such a decompiler compiler: a logic and a functional programming approach. The former approach makes use of the bidirectionality of logic programming languages such as Prolog, and runs the specification of the compiler backwards to obtain a decompiler [Bowe91a, Bowe91b, Bowe93b]. In theory this is correct, but in practice this approach is limited to the implementation of the Prolog interpreter, and therefore problems of strictness and reversibility are encountered [Breu92, Breu93]. The latter approach is based on the logic approach but makes use of lazy functional programming languages like Haskell, to generate a more efficient decompiler [Bowe91a, Bowe91b, Bowe93b]. Even if a non-lazy functional language is to be used, laziness can be simulated in the form of objects rather than lists.

The decompiler produced by a decompiler compiler will take as input object code and return a list of source codes that can be compiled to the given object code. In order to achieve this, an enumeration of all possible source codes would be required, given a description of an arbitrary inherited attribute grammar. It is proved that such an enumeration is equivalent to the Halting Problem [Breu92, Breu93], and is therefore non-computable. Even further, there is no computable method which takes an attribute grammar description and decides whether or not the compiled code will give a terminating enumeration for a given value of the attribute [Breu92, Breu93], so it is not straightforward which grammars can be used. Therefore, the class of grammars acceptable to this method needs to be restricted to those that produce a complete enumeration, such as non left-recursive grammars.

An implementation of this method was firstly done for a subset of an Occam-like language using a functional programming language. The decompiler grammar was an inherited attribute grammar which took the intended object code as an argument [Breu92, Breu93]. A Prolog decompiler was also described based on the compiler specification. This decompiler applied the clauses of the compiler in a selective and ordered way, so that the problem of non-termination would not be met, and only a subset of the source code programs would be returned (rather than an infinite list) [Bowe91c, Bowe93]. Recently, this method made use of an imperative programming language, C++, due to the inefficiencies of the functional and logic approach. In this prototype, C++ object's were used as lazy lists, and a set of library functions was written to implement the operators of the intermediate representation used [Breu94]. Problems with optimized code have been detected.

As illustrated by this research, decompiler compilers can be constructed automatically if the set of compiler specifications and object code produced for each clause of the specification is known. In general, this is not the case as compiler writers do not disclose their compiler specifications. Only customized compilers and decompilers can be built by this method. It is also noted that optimizations produced by the optimization stage of a compiler are not handled by this method, and that real executable programs cannot be decompiled by the decompilers generated by the method described. The problem of separating instructions from data is not addressed, nor is the problem of determining the data types of variables used in the executable program. In conclusion, decompiler compilers can be generated automatically if the object code produced by a compiler is known, but the generated decompilers cannot decompile arbitrary executable programs.

8086 C Decompiling System, 1991--1993

This decompiler takes as input executable files from a DOS environment and produces C programs. The input files need to be compiled with Microsoft C version 5.0 in the small memory model [Fuan93]. Five phases were described: recognition of library functions, symbolic execution, recognition of data types, program transformation, and C code generation. The recognition of library functions and intermediate language was further described in [Fuan91, Hung91].

The recognition of library functions for Microsoft C was done to eliminate subroutines that were part of a library, and therefore produce C code for only the user routines. A table of C library functions is built-into the decompiling system. For each library function, its name, characteristic code (sequence of instructions that distinguish this function from any other function), number of instructions in the characteristic code, and method to recognize the function were stored. This was done manually by the decompiler writer. The symbolic execution translated machine instructions to intermediate instructions, and represented each instruction in terms of its symbolic contents. The recognition of data types is done by a set of rules for the collection of information on different data types and analysis rules to determine the data type in use. The program transformation transforms storage calculation into address expressions, e.g. array addressing. Finally, the C code generator transforms the program structure by finding control structures, and generates C code.

This decompiling system makes use of library function recognition to generate more readable C programs. The method of library recognition is hand-crafted, and therefore inefficient if other versions of the compiler, other memory models, or other compilers were used to compile the original programs. The recognition of data types is a first attempt to recognize types of arrays, pointers and structures, but not much detail is given in the paper. No description is given as to how an address expression is reached in the intermediate code, and no examples are given to show the quality of the final C programs.

Alpha AXP Migration Tools, 1993

When Digital Equipment Corporation designed the Alpha AXP architecture, the AXP team got involved in a project to run existing VAX and MIPS code on the new Alpha AXP computers. They opted for a binary translator which would convert a sequence of instructions of the old architecture into a sequence of instructions of the new architecture. The process needed to be fully automatic and to cater for code created or modified during execution. Two parts to the migration process were defined: a binary translation, and a runtime environment [Site93].

The binary translation phase took binary programs and translated them into AXP opcodes. It made use of decompilation techniques to understand the underlying meaning of the machine instructions. Condition code usage analysis was performed as these conditions do not exist on the Alpha architecture. The code was also analyzed to determine function return values and find bugs (e.g. uninitialized variables). MIPS has standard library routines which are embedded in the binary program. In this case, a pattern matching algorithm was used to detect routines that were library routines, such routines were not analysed but replaced by their name. Idioms were also found and replaced by an optimal instruction sequence. Finally, code was generated in the form of AXP opcodes. The new binary file had both, the new code and the old code.

The runtime environment executes the translated code and acts as a bridge between the new and old operating systems (e.g. different calling standards, exception handling). It had a built-in interpreter of old code to run old code not discovered or nonexistent at translation time. This was possible because the old code was also saved as part of the new binary file.

Two binary translators were written: VEST, to translate from the OpenVMS VAX system to the OpenVMS AXP system, and mx, to translate ULTRIX MIPS images to DEC OSF/1 AXP images. The runtime environments for these translators were TIE and mxr respectively.

This project illustrates the use of decompilation techniques in a modern translation system. It proved successful for a large class of binary programs. Some of the programs that could not be translated were programs that were technically infeasible to translate, such as programs that use privileged opcodes, or run with superuser privileges.

Source/PROM Comparator, 1993

A tool to demonstrate the equivalence of source code and PROM contents was developed at the Nuclear Electric plc, UK, to verify the correct translation of PL/M-86 programs into PROM programs executed by safety critical computer controlled systems [Pave93].

Three stages are identified: the reconstitution of object code files from the PROM files, the disassembly of object code to an assembler-like form with help from a name-table built up from the source code, and decompilation of assembler programs and comparison with the original source code. In the decompiling stage, it was noted that it was necessary to eliminate intermediate jumps, registers and stack operations, identify procedure arguments, resolve indexes of structures, arrays and pointers, and convert the expresssions to a normal form. In order to compare the original program and the decompiled program, an intermediate language was used. The source program was translated to this language with the use of a commercial product, and the output of the decompilation stage was written in the same language. The project proved to be a practical way of verifying the correctness of translated code, and to demonstrate that the tools used to create the programs (compiler, linker, optimizer) behave reliably for the particular safety system analyzed.

This project describes a use of decompilation techniques, to help demonstrate the equivalence of high-level and low-level code in a safety-critical system. The decompilation stage performs much of the analysis, with help from a symbol table constructed from the original source program. The task is simplified by the knowledge of the compiler used to compile the high-level programs.

new A. Mycroft's Type Reconstruction for Decompilation, 1999

One of the hardest problems to solve in decompilation is that of recovering high-level data types from machine code in a correct way. Such types include structures, arrays and more. In this work, Alan Mycroft presents a system for infering high-level types from assembler-based (RTL) code [Mycr99]. Alan's type inference system is based on Milner's work for ML. The paper presents a type system, the constraints on types and worked-through examples that include structures and arrays as part of their output. Experimental results for the system are not available.

This is the best type system that I am aware of that currently deals with recoverying high-level data types in a machine-independent way, as it is based on RTLs and makes no unreasonable assumptions on the shape of the RTLs. Implementation results are needed in order to determine how feasible this system is in real practice.


References

Hals62
M.H. Halstead. Machine-independent computer programming, Chapter 11, pages 143-150. Spartan Books, 1962.

Hals67
M.H. Halstead. Machine independence and third generation computers. In Proceedings SJCC (Sprint Joint Computer Conference), pages 587-592, 1967.

Hals70
M.H. Halstead. Using the computer for program conversion. Datamation, pages 125-129, May 1970.

Sass66
W.A. Sassaman. A computer program to translate machine language into Fortran. In Proceedings SJCC, pages 235-239, 1966.

Hous73
B.C. Housel. A Study of Decompiling Machine Languages into High-Level Machine Independent Languages. PhD dissertation, Purdue University, Computer Science, August 1973.

Hous73b
B.C. Housel and M.H. Halstead. A methodology for machine language decompilation. Technical Report RJ 1316 (#20557), Purdue University, Department of Computer Science, December 1973.

Holl73
C.R. Hollander. Decompilation of Object Programs. PhD dissertation, Stanford University, Computer Science, January 1973.

Barb74
P. Barbe. The Piler system of computer program translation. Technical report, Probe Consultants Inc., September 1974.

Frie74
F.L. Friedman. Decompilation and the Transfer of Mini-Computer Operating Systems. PhD dissertation, Purdue University, Computer Science, August 1974.

Schn74
V. Schneider and G. Winiger. Translation grammars for compilation and decompilation. BIT, 14:78-86, 1974.

Bake77
B.S. Baker. An algorithm for structuring flowgraphs. Journal of the ACM, 24(1):98-120, January 1977.

Hopw78
G.L. Hopwood. Decompilation. PhD dissertation, University of California, Irvine, Computer Science, 1978.

Work78
D.A. Workman. Language design using decompilation. Technical report, University of Central Florida, December 1978.

Balb79
D. Balbinot and L. Petrone. Decompilation of Polish code in Basic. Rivista di Informatica, 9(4):329-335, October 1979.

Brin81
D.L. Brinkley. Intercomputer transportation of assembly language software through decompilation. Technical report, Naval Underwater Systems Center, October 1981.

Bert81
M.N. Bert and L. Petrone. Decompiling context-free languages from their Polish-like representations. pages 35-57, 1981.

Katz82
R.H. Katz and E. Wong. Decompiling CODASYL DML into relational queries. ACM Transactions on Database Systems, 7(1):1-23, March 1982.

Dors82
L.M. Dorsey and S.Y. Su. The decompilation of COBOL-DML programs for the purpose of program conversion. In Proceedings of COMPSAC 82. IEEE Computer Society's Sixth International Computer Software and Applications Conference, pages 495-504, Chicago, USA, November 1982. IEEE.

Dudl82
R. Dudley. A recursive decompiler. FORTH Dimensions, 4(2):28, Jul-Aug 1982.

Hill84
N.L. Hills and D. Moines. Revisited: Recursive decompiler. FORTH Dimensions, 5(6):16-18, Mar-Apr 1984.

Woo85
C.W. Yoo. An approach to the transportation of computer software. Information Processing Letters, 21:153-157, September 1985.

May88
W. May. A simple decompiler. Dr.Dobb's Journal, pages 50-52, June 1988.

Reut88
J. Reuter. URL: ftp//cs.washington.edu/ pub/decomp.tar.z. Public domain software, 1988.

Guth90
S. Guthery. exe2c. News item in comp.compilers USENET newsgroup, 30 Apr 1990.

Guth91a
S. Guthery. exe2c. News item in comp.compilers USENET newsgroup, 23 Apr 1991.

Reut91
J. Reuter. Private communication. Email, 1991.

Aust91
Austin Code Works. exe2c. Beta release, 1991. Email: info@acw.com.

Bowe91a
J.P. Bowen, P.T. Breuer, and K.C. Lano. The REDO project: Final report. Technical Report PRG-TR-23-91, Oxford University Computing Laboratory, 11 Keble Road, Oxford OX1 3QD, December 1991.

Bowe91b
J. Bowen and P. Breuer. Decompilation techniques. Internal to ESPRIT REDO project no. 2487 2487-TN-PRG-1065 Version 1.2, Oxford University Computing Laboratory, 11 Keble Road, Oxford OX1 3QD, March 1991.

Bowe91c
J. Bowen. From programs to object code using logic and logic programming. In R. Giegerich and S.L. Graham, editors, Code Generation - Concepts, Tools, Techniques, Workshops in Computing, pages 173-192, Dagstuhl, Germany, May 1991. Springer Verlag.

Hood91
S.T. Hood. Decompiling with definite clause grammars. Technical Report ERL-0571-RR, Electronics Research Laboratory, DSTO Australia, PO Box 1600, Salisbury, South Australia 5108, September 1991.

Hung91
L. Hungmong, L. Zongtian, and Z. Yifen. Design and implementation of the intermediate language in a PC decompiler system. Mini-Micro Systems, 12(2):23-28,46, 1991.

Fuan91
C. Fuan and L. Zongtian. C function recognition technique and its implementation in 8086 C decompiling system. Mini-Micro Systems, 12(11):33-40,47, 1991.

Guth91b
S. Guthery. Private communication. Austin Code Works, 11100 Leafwood Lane, Austin, TX 78750-3587, 14 Dec 1991.

OGor91
J. O'Gorman. Systematic Decompilation. PhD Thesis. Technical Report UL-CSIS-91-12, University of Limerick, 1991.

Breu92
P.T. Breuer and J.P. Bowen. Decompilation: The enumeration of types and grammars. Technical Report PRG-TR-11-92, Oxford University Computing Laboratory, 11 Keble Road, Oxford OX1 3QD, 1992.

Bowe93
J. Bowen. From programs to object code and back again using logic programming: Compilation and decompilation. Journal of Software Maintenance: Research and Practice. 5(4):205-234, 1993.

Breu93
P.T. Breuer and J.P. Bowen. Decompilation: the enumeration of types and grammars. Transaction of Programming Languages and Systems, 1993.

Bowe93b
J. Bowen, P. Breuer, and K. Lano. A compendium of formal techniques for software maintenance. Software Engineering Journal, pages 253-262, September 1993.

Pave93
D.J. Pavey and L.A. Winsborrow. Demonstrating equivalence of source code and PROM contents. The Computer Language, 36(7):654-667, 1993.

Fuan93
C. Fuan, L. Zongtian, and L. Li. Design and implementation techniques of the 8086 C decompiling system. Mini-Micro Systems, 14(4):10-18,31, 1993.

Breu94
P.T. Breuer and J.P. Bowen. Generating decompilers. Information and Software Technology Journal, 1994.

Site93
R.L. Sites, A. Chernoff, M.B. Kirk, M.P. Marks, and S.G. Robinson. Binary translation. Communications of the ACM, 36(2):69-81, February 1993.

Cifu94
C. Cifuentes. Reverse Compilation Techniques. PhD Dissertation. Queensland University of Technology, Department of Computing Science, 1994.

Mycr99
A. Mycroft. Type-Based Decompilation. Proceedings of ESOP'99, LNCS 1576, Springer-Verlag, 1999.

Back to decompilation page


Copyright © 1998 Cristina Cifuentes, All Rights Reserved.