Skip to main content

Archaeology: Digging up JDistill

Posted by emcmanus on April 17, 2011 at 9:19 AM PDT

I was looking over some old stuff, and found JDistill, a byte-code reduction program I wrote in 1998. Although it won't work unchanged on today's class files,
and its copyright status is murky, I thought that the article I wrote might still have some interest. Here it is, with only minor edits.

JDistill, a program to shrink Java packages

Éamonn McManus

3 April 1998

JDistill is a Java program that reduces the size of the class files
that make up a Java package. It does this in several ways:

  • It deletes information that is used for debugging but not actually
    needed for correct execution of the Java bytecode. This includes
    line-number information, for example.

  • It traces through all the executable code in the package that is
    reachable from outside the package in order to determine what is
    referenced. Anything that is never referenced is deleted. This
    can include fields and methods of Java classes, or even entire
    classes themselves.

  • It replaces the names of any methods and fields that have not been
    deleted and that are not visible outside the package with very
    short names. This has the side-effect of making the class file much
    harder to discompile.

  • It detects the frequently very bulky code generated by the current
    JDK javac compiler (versions up to 1.2beta2 at least) to initialize
    arrays, and replaces it with code that takes up much less space in
    the class file.

  • A command-line option tells JDistill that it is being given the
    whole Java program, i.e., no new classes will be added. In this
    case many more methods and fields can be deleted or renamed.

This document explains the rationale for JDistill, shows how to use it
to distill (shrink) packages, and gives some results for the standard
JDK packages in 1.1.5. The final section goes into detail about the


Java class files serve two distinct purposes. They contain
bytecode to be executed when the program using them is run;
and they contain information used when other Java classes are
compiled. For a Java program that is a product, this additional
information for the compiler is not needed and can be deleted.

Deleting the extra information makes the class files smaller, and also
makes them much harder to decipher. Tools exist that can reconstitute
almost exactly the Java source code that was used to produce a given
class file; but if the names have been dramatically shortened (a form
of obfuscation) this source code will be of little use.

When the Java compiler compiles a Java package, it does not know if
additional classes will subsequently be added to the package. JDistill
operates on the assumption that they will not. That means that
anything within the package that is not referenced now will
never be referenced and can be deleted.

You might imagine that anything JDistill can delete out of a package
could also be deleted from the Java source code by a human user. This
is not true for a number of reasons.

  • When you declare a constant in Java, like this:
    	static final int MAX_VALUE = 32767;

    the compiler must leave this field in the generated class file,
    so that other classes compiled later will be able to obtain its
    value. But no compiled class will ever refer to the field by name,
    because instead of generating a reference to
    SomeClass.MAX_VALUE the compiler will just copy the
    constant value 32767. JDistill will see that there are no references
    to MAX_VALUE in existing class files (because of this
    replacement) and will know that nothing outside the package can
    subsequently refer to MAX_VALUE (because it is not
    public), so it can delete it.

  • A product might exist in several configurations, selected at
    compile time. It is possible to eliminate code within methods
    when it belongs to a configuration that is not enabled, by doing
    something like:
    	if (Config.DEBUG) {
        // debug code...

    But this technique cannot eliminate methods themselves or fields
    or even classes that are only used when Config.DEBUG
    is true. JDistill can.

  • Renaming methods and fields can save a small amount of space and
    can render class files very difficult to decipher. But it would
    be counterproductive to do the renaming in the Java source files!

  • As mentioned above, the Java compiler generates bulky code to
    initialize constant arrays - the same code as it generates for
    non-constant arrays, in fact. If you write:
    	static final int[] powers = {1, 5, 25, 125, 625, 3125};

    the compiler will generate code almost as if you had written:
    	static final int[] powers;
    static {
        powers = new int[6];
        powers[0] = 1;
        powers[1] = 5;
        powers[5] = 3125;

    (The code is slightly more efficient than that because it can keep
    a reference to the powers array on the operand
    stack.) You can write code that takes up less space
    to initialize the array, by storing the initial values in a
    constant String and pulling them out in a loop, but it is
    unnecessary work to do so. JDistill does it for you automatically.

Using JDistill

2011 Éamonn says: you might want to skip to the section on technical details.

The command line for using JDistill looks like this:

java ORG.opengroup.jdistill.Main [-options] directory [...]

At its simplest, all of the .class files in the named
directory are read, transformed to make them smaller, and written back
over the originals. It is assumed that the class files form a package
and that that package is closed, meaning that no new classes
will be added to it after distilling, or that it will be redistilled after
new classes are added.

Several directories can be named instead of just one; currently the
effect is the same as if JDistill had been run on each directory
individually. Files can also be named instead of directories,
whereupon they are treated separately without the "closed"

Invoking JDistill without any arguments or with incorrect arguments
produces a summary of the available options. The options are as
follows. Each option can be abbreviated to any unambiguous prefix.


Followed by a pattern that says where output files should be
written. The pattern should have two parts separated by
=, and each part should contain exactly one
% character. Each input file should match the pattern on
the left of the =, with % being a wildcard
that matches any sequence of characters. The name of the
corresponding output file will be determined by the pattern on the
right of the =, with the % being replaced by
the same sequence of characters. Thus for example the option
	-out /tmp/classes/%=/tmp/shrunkclasses/%

will cause a class file read from
/tmp/classes/java/lang/String.class to be written to

The % syntax is inspired by GNU make. The more obvious
* character was avoided to prevent problems with Unix
shell expansion.

If this option is not provided, distilled class files are
written over the original undistilled ones. In that case classes
that are found not to be referenced are not deleted; it would be
more accurate to say that they are not copied when
-output-replace is specified.


Followed by the name of a file from which further options are
to be read. Typically there will be -keep-* options
in this file, kept up to date as needed. The syntax of options in
the file is the same as on the command line except that they can
be provided on several lines, and lines beginning with a
# character are ignored.


Print a line for each class file that is read to be distilled, each
class that is loaded without being distilled (to provide inheritance
information, for instance), and each class file that is written.


Do not print messages about constructs found in the class files
that might cause problems with the distillation. Currently the only
warnings are associated with classes loaded dynamically using
Class.forName(String). If the argument is not a
constant string, or if it is a string naming a class that cannot
be found at distill time, a warning is issued.


Use the class path immediately following this option instead of
the standard class path when looking for classes that are not
included in the set to be distilled but are referenced from it. Thus
for instance JDistill will almost always want to load
java.lang.Object to find out if any methods in that
class are overridden by the classes being distilled; if
those classes belong to a different system than the one where
JDistill is being run then it would not be right to load the same
java.lang.Object that is being used for the JDistill
code itself.


Do not delete the class following this option even if it is
apparently unreferenced. The Java syntax for a class is used, not
the name of the class file. So

might cause the file /unshrunk/java/io/Thing.class to
be copied to /shrunk/java/io/Thing.class even if is not a public class and there are no
references to it within

This option does not affect deletion or renaming of methods and
fields within the named class.


Do not delete or rename the field following this option
even if it would apparently be all right to do so. This is
especially useful for fields that are accessed only from native
code. The field is given in its fully-qualified form,
i.e., the full name of its class followed by the name of
the field, such as java.awt.Font.pData.

If a field is accessed from native code but not using JNI, all
preceding fields should also be kept, or javah should
be run on the distilled classes rather than the originals.


Do not delete or rename the method following this option even if
it would apparently be all right to do so. There is no way to
differentiate among overloaded methods with the same name using
this option: all methods with the given name will be preserved.
Again this is useful for methods that are accessed only from
native code. The method is given in its fully-qualified form but
without parameters, for instance


Do not delete any constructors for the class following this option
even if they are apparently unreferenced. The class name is spelt
out as for -keep-class.


Do not consider methods with the signature
	public static void main(String[])

as entry points, unless they are in public
classes, in which case they would be entry points anyway. This
option allows main methods introduced for debugging
to be automatically deleted from production code.


Assume that no classes other than those being distilled will ever
be loaded. This means that anything that is not referenced within
the distilled classes is not referenced at all and can be deleted.


Treat each directory argument as the root of a
directory hierarchy and shrink every file in that hierarchy whose
name matches *.class.


Make intermediate directories as needed when creating output class


Print a disassembly of the structure of each class file as it is
read, including a symbolic representation of the bytecode of each
method. The information printed is similar to that printed by the
standard javap tool, but more detailed.


Show fields, methods, and classes that are deleted because they
are unreferenced. The information displayed can be very
instructive. It may reveal parts of a program that should have
been deleted during some earlier modification; it may reveal
faulty program logic that means some code is never reached; or it
may indicate things that are only referenced from native code and
so need to be protected using the -keep options. If
the program being distilled was compiled with -O, short
methods will be inlined where they are called, so if they are not
visible outside their package JDistill will legitimately delete


Also show constants from the constant pool that are
deleted. These will typically include the names of deleted fields
and methods, and also constant strings or integers that were only
referred to from within deleted methods. If fields and methods
are renamed, the old names will normally be deleted too.


Show fields and methods that are renamed to shorter names. You
may want to keep this information in case you ever have to debug a
stack trace through methods that are now called A or


Show the analysis that is done to determine what methods can be
renamed. The existence of method overriding and hiding ( href="">Java
Language Specification §8.4.6) means that this
analysis is complex, as described later. With this option, if a
method is not renamed then JDistill explains why, and it also shows
sets of methods that were found to be required to have the same name.


Do not rename any methods or fields.


Show all array initializations that were considered for
shortening, with the savings obtained for those that were indeed
shortened and the reason why not for those that were not.


Show additional information about array shortening. This
information is principally of interest for debugging JDistill


Do not shorten any array initializations.


Show the results of the JDistill reference analysis before anything
is deleted. Essentially, the list of things that are unreferenced
is the same as the list of things that will be deleted, so the
principal interest of this option is to see how many
references there were to certain things. Currently, only
constants in the constant pool are counted in this way; for other
entities the information is just whether they are referenced or


Print a detailed trace of the flow of execution as references from
bytecode are determined. This option produces a great deal of
output and is mainly of use for debugging JDistill.


These measurements were made on an HP/UX machine running version 1.1.5
of the JDK. All of the standard classes of the JDK were distilled using
all available shrinking. A few -keep-* options had to be
provided to prevent incorrect deletion of fields and methods used only
by native code. With these options in place, the distilled classes get
the same results in the JCK as the original classes.

The overall reduction is a spectacular 32%, but a great deal of that
is attributable to two factors: shortened array initializations,
and classes compiled without optimization. Some classes, for instance
java.lang.Character and almost all of the
package, contain big arrays which the javac compiler turns into huge
bytecode. A later section will mention some href="#arrayshrinkprobs">problems that the shortened
array initializations may cause. The standard packages distributed as
part of the JDK ought to be compiled with optimization (-O flag),
which as a side-effect leaves out line-number information, but
some of them are not, which makes even larger
than it would be.

The standard packages measured here are not typical of Java
applications, in that they export many individual methods but there is
less cooperation between classes in individual packages. The fact
that most of the package information is exported greatly limits what
JDistill can do, since publicly-visible fields and methods cannot be
deleted or renamed.

As another example, the JDistill program itself, compiled with
optimization (-O flag), was distilled. Here the saving was
19%, a figure probably more typical of applications. There were no
large arrays to be shrunk so the figures with and without that
optimization were essentially the same.

Package Old size New size Saved
java.lang 160030 97791 62239 (38.9%)
java.lang.reflect 11504 10990 514 (4.5%)
java.util 64858 60109 4749 (7.3%) 37163 31614 5549 (14.9%) 111783 103385 8398 (7.5%)
java.math 21682 17622 4060 (18.7%)
java.text 229856 102778 127078 (55.3%)
java.text.resources 376544 323447 53097 (14.1%) 60327 52467 7860 (13%) 35759 29163 6596 (18.4%) 3233 2358 875 (27.1%) 1561 1015 546 (35%)
java.awt 252342 208434 43908 (17.4%)
java.awt.peer 9760 6739 3021 (31%)
java.awt.image 30732 24132 6600 (21.5%)
java.awt.datatransfer 6160 4979 1181 (19.2%)
java.awt.event 31704 26429 5275 (16.6%)
java.applet 3931 3124 807 (20.5%)
java.sql 36624 32136 4488 (12.3%)
java.rmi 16475 12601 3874 (23.5%)
java.rmi.registry 3023 2489 534 (17.7%)
java.rmi.dgc 2593 2010 583 (22.5%)
java.rmi.server 24504 20112 4392 (17.9%)
java.beans 54140 43790 10350 (19.1%) 150833 145779 5054 (3.4%) 67557 66230 1327 (2%) 115499 90953 24546 (21.3%) 43774 41376 2398 (5.5%) 310942 293189 17753 (5.7%) 31606 26536 5070 (16%) 9821 8273 1548 (15.8%) 21708 17665 4043 (18.6%) 49613 15245 34368 (69.3%) 30014 24241 5773 (19.2%) 5077 4228 849 (16.7%) 6928 5969 959 (13.8%)
sun.misc 54184 40048 14136 (26.1%) 5350264 3140814 2209450 (41.3%) 13079 10515 2564 (19.6%) 8599 7201 1398 (16.3%) 6880 5677 1203 (17.5%) 4531 3529 1002 (22.1%) 27227 21537 5690 (20.9%) 1441 1062 379 (26.3%) 1926 1485 441 (22.9%) 6001 5160 841 (14%) 16302 13041 3261 (20%) 3871 3400 471 (12.2%) 1484 1261 223 (15%) 3468 2897 571 (16.5%) 9285 8084 1201 (12.9%) 8569 7085 1484 (17.3%) 4504 3852 652 (14.5%) 2916 2506 410 (14.1%) 7499 6291 1208 (16.1%) 15538 13132 2406 (15.5%) 6589 5685 904 (13.7%) 18830 9600 9230 (49%)
sun.awt 36519 26444 10075 (27.6%)
sun.awt.motif 173468 138929 34539 (19.9%)
sun.awt.image 59173 45877 13296 (22.5%)
sun.awt.tiny 147662 117500 30162 (20.4%)
sun.applet 92317 75045 17272 (18.7%)
sun.applet.resources 14299 12983 1316 (9.2%) 13307 10566 2741 (20.6%) 22505 18072 4433 (19.7%) 44805 37444 7361 (16.4%) 25451 22040 3411 (13.4%) 72578 57993 14585 (20.1%)
sun.rmi.registry 12711 10733 1978 (15.6%)
sun.rmi.server 19483 16666 2817 (14.5%)
sun.rmi.transport 56585 45911 10674 (18.9%)
sun.rmi.transport.proxy 40943 32851 8092 (19.8%)
sun.rmi.transport.tcp 45573 37181 8392 (18.4%)
sun.rmi.rmic 45814 38269 7545 (16.5%)
sun.beans.editors 17552 15000 2552 (14.5%)
sun.beans.infos 1477 1235 242 (16.4%)
sun.jdbc.odbc 152157 125688 26469 (17.4%)
sunw.util 537 266 271 (50.5%) 231 100 131 (56.7%)
Total (80 packages) 9097794 6160053 2937741 (32.3%)
Average 113722 77001 36722 (32.3%)

Technical details

JDistill does three essentially independent tasks when working on a

  1. It traces the bytecode to find out what is referenced and deletes
    whatever is not;

  2. It shortens the names of fields and methods whose existing names
    are not visible outside the package;

  3. It shortens array initializations.

Finding references

The algorithm for tracing all of the bytecode in a package that can
ever be reached is conceptually simple: start from the public
entry points
and trace through instructions; every time you see
an instruction that references a field, mark that field as referenced;
every time you see an instruction that invokes a method, mark that
method as referenced and trace it too.

The public entry points of a package are
determined as follows:

  • Every public method in a public class
    is an entry point because it can be called directly by Java code
    outside the package.

  • Every protected method in a public class
    is an entry point because it can be called by a subclass of
    the class outside the package (this is not true if the class is
    final but we don't currently check that).

  • Every public or protected method in a
    class that is not public but that is subclassed by a
    public class in the same package is an entry point
    because it is visible through the subclass.

  • A method that looks like
    	public static void main(String[] args)

    is an entry point because a Java program can start there, even if
    it is in a class that is not public. Since such
    a method often contains test code for the containing class that is
    not supposed to be included in the final product, JDistill has a
    command-line option -no-main to tell it not to consider
    this kind of entry point unless it is covered by one of the other

  • Every method, even a private one, in a class that
    contains native method declarations is an entry
    point. Native code can typically access any field or method of
    any class regardless of Java-language visibility. We can't know
    what fields or methods are accessed in this way, and we don't want
    to keep all fields and methods just in case they might be accessed
    by native code, so we assume that classes with native
    methods are the only ones whose methods or fields are accessed in
    this way. When this assumption is wrong, it can be corrected as
    explained in the next paragraph.

  • Every method that was explicitly named by a
    -keep-method or
    command-line option is an entry point. You can use this to
    preserve methods called only from native code and not included in
    the heuristic described just above.

A constructor is considered to be a method like any other
for the purposes of these rules.

If the -closed-program option is given, only the last two
rules are applied, but then if a class containing native
methods is determined to be referenced, all of its methods are
considered to be entry points.

While we are looking for public entry points, we also look for
visible fields, that is fields
that can be accessed directly or indirectly from outside the package.
The rules here are the following:

  • Every public or protected field in a
    public class or a class that has a
    public subclass in the package is visible. This is
    essentially the same rule as for methods.

  • Every field in a class that has native methods is
    visible, for the same reasons as explained for methods above.

  • Every field in a class that implements
    (directly or by inheritance) is visible unless it is
    static or transient, because changing
    the field in any way would break compatibility with
    previously-serialized objects of this class.

  • If a class that implements has a
    field called serialVersionUID then that field also
    participates in serialization even if it is static
    (in fact it always is static).

  • Every field that was explicitly named by a
    -keep-field command-line option
    is visible. Again this can be used to prevent the deletion of fields
    that are only referenced from native code and which are in classes
    that do not themselves contain native methods.

Here again, the public rule does not come into play when
-closed-program is given, and the other rules except the
last are only applied to classes that are found to be referenced.

Tracing bytecode

We examine the instructions of every method that is an entry point as
explained above, and every method that is called by such a method, and
so on recursively. The following sorts of instructions have to be

  • A checkcast, instanceof,
    anewarray, getstatic,
    putstatic, or invokestatic instruction
    causes the class it references to be
    visited statically,
    as explained below.

  • A new, getfield, putfield,
    invokevirtual, invokespecial, or
    invokeinterface instruction causes the class it
    references to be visited
    , as explained below.

  • The various invoke* instructions also cause the
    methods they invoke to be traced if they have not already been.

  • If an invokestatic instruction turns out to invoke
    the java.lang.Class.forName(String) method then we have
    detected a dynamically-loaded class. If the class being loaded is
    known (the argument to forName is a constant
    String) then it is visited
    Furthermore, its no-arg constructor is traced, because it
    might be invoked through the newInstance() method in
    java.lang.Class. (An earlier version only traced the
    no-arg constructor if a call to newInstance() was
    seen within the package, but this was wrong because the
    Class reference could easily be passed outside the
    package and newInstance() invoked there.)

    If the argument to Class.forName is not a
    constant String, a warning is issued, because JDistill's
    assumptions about reachability might be violated.

    You will need to use the various -keep-* command-line
    options if the package being distilled contains classes, or fields and
    methods within classes, that are only referenced through dynamic
    loading and reflection.

  • The field-access instructions (getstatic etc) also
    cause the fields they name to be marked as referenced.

  • The ldc, ldc_w, and ldc2_w
    instructions cause the constants they reference to be marked as

A class is visited statically when
we know that it will be initialized (in the sense of the href="">
Java Language Specification, §12.4.1). In this case the
class is marked as referenced if it was not already, and if it has a
static initializer method (<clinit>) that method's
bytecode will be traced.

A class is visited dynamically when
we know that there must be at least one instance of it for the program
to be correct. In this case we have further information about the class:

  • If it implements the
    interface and contains one or both of the methods
    	void readObject( in)
    void writeObject( out)

    then those methods are potentially referenced and must be traced.

  • If it overrides methods in an ancestor class that are referenced
    then the overriding methods must also be considered to be
    referenced. This is further explained in the next section.

We do not explicitly visit a parent class dynamically when one if its
children is visited dynamically, but if the child is really
instantiated then we should end up visiting at least one of its
constructors, and this constructor will start with an invocation of
one of the parent's constructors, which will provoke a dynamic visit.

Handling overriding methods

Once all of the public entry points and all of the methods they
invoke have been traced recursively, a further important way to reach
methods must be considered. A method that is never itself directly
invoked, but that overrides a method in an ancestor class that is,
must be traced, provided that the class it is contained in is actually

Finding indirectly-referenced overriding methods is not completely
straightforward. Here is how we tackle it. We loop over all the
classes in the package looking for non-static methods
that are currently not referenced. For each such method, we recursively
scan the ancestors (superclasses and superinterfaces) looking for
methods with the same name and parameters, which are therefore
overridden; we thus construct a set containing the first overridden method
found on each path through the ancestor graph. If we find a referenced
overridden method, then we mark the overriding method as referenced too
and trace it. After we have scanned all the classes in the package and
if we found any overriding methods, we repeat the whole process, until
no further overriding methods are found.

In this way we are sure to propagate the referencedness all the way
down a possibly long chain of overrides, and we are sure to pick up new
chains of overrides that start at a method that is itself only referenced
in an overriding method.

A more straightforward approach would be possible if we linked classes
to all their subclasses and then methods to all their overriders, but
building that information would not be much simpler than what we do

In the diagram,
Tracing indirect method references align=right>

the method "0" is the only one explicitly referenced in the program.
The methods labeled "1" are in subclasses that override it, and the
methods labeled "2" override those. The first time we loop through
all the classes we will find that the methods "1" are unreferenced but
override a referenced method, so we will mark them referenced and
trace them. Then either later in that scan through the classes or on
the next scan, depending on the order in which we visit the classes,
we will do the same thing for the methods labeled "2".

Renaming fields and methods

JDistill renames fields and methods in the package being shrunk so that
they are shorter. Shortening the name of a field or method reduces
space not just in the class where it is defined, but in every class
that references it. If a member (a field or method) is referenced in
a class there will be a single constant naming the class containing the
member and the name of the member itself. This constant will be shared
between all the references from the same class, so renaming will have
most effect when there are many small classes rather than a few big ones.

Renaming fields

Renaming fields is straightforward. Any field that is not href="#visiblefield">visible can be renamed. JDistill renames fields in a
given class starting at A and counting up to Z,
then a to z, then _, then
longer names starting with these characters but also including the
digits 0 to 9. This means that all new names
are valid Java identifiers. The specification of class files does not
actually require this, but it seems best to avoid possible problems with
incorrect implementations that do.

The only other constraint on chosen field names is that they must not
conflict with existing field names in the same class, specifically those
that are visible. This is easily done using a hash table that is needed
anyway in order to follow field references to the corresponding field.

Renaming methods

Renaming methods is much harder because of overriding. The field
names in a class are independent of those in its ancestors and
descendants; class files specify exactly which class's field is being
referenced. The names of methods in a class on the other hand are
intimately tied to the names of methods in its ancestors and descendants.
If a method has the same name as a method in an ancestor, then both
names must still be the same after renaming. Contrariwise, if
the name of a method is different from the name of another method in
the same class or in any ancestor or descendant, then it must remain
different after renaming.

Overloaded methods

Overloading also complicates method renaming slightly. Where we
talked about names being different in the previous paragraph, we
should really have talked about signatures. That is, the
combination of name and parameter types must be the same or
different according as there is an overriding relationship between two

Methods that must be renamed together

Before we can rename any methods, we must group together all methods
that have an overriding relationship so that they will all continue to
have the same name after renaming. This is straightforward. With
every method in every class, we associate a set of other methods.
This is usually a set of methods that must have the same name; but
if a method must not be renamed, either because it is an href="#entry">entry point or because it has an overriding relationship
with an entry point, then it will belong to a special set called the
fixed-name set.

To construct the sets of methods we proceed as follows. We visit the
classes in ancestor-first order. For each method in each class we
visit, we first check if it can be renamed at all. If not, it is
added to the fixed-name set. If so, it is added to a newly-created
set. Then, for every method that the method we are visiting
overrides, we merge together the sets for the overriding and
overridden methods. In this way, if any method in the group of
methods that must have the same name as the current method turns out
not to be renameable, all of the methods in the group will end up in
the fixed-name set. If all methods in the group can be renamed then
they will all end up in the same set.

Finding new names

We can consider finding new method names as a
graph-colouring problem. In the graph, the nodes are methods and
an edge joins two nodes if the corresponding methods must not
have the same name, because they have the same parameters and
either they are in the same class or one of them is in an
ancestor class of the other. (Methods for which this is true and
that already have the same name already have an overriding
relationship and can be coalesced into the same node.) The graph
is actually a collection of disjoint graphs, one for every
possible parameter signature. These graphs can be considered to
be coloured when we start, where the colours are method names and
no two nodes connected by an edge can have the same colour. What
we want to do is to recolour those nodes where this is allowed so
as to use more efficient colours (shorter names).

Building this graph would be expensive and unnecessarily
complicated. Instead we use some heuristics to achieve a result
that is theoretically non-optimal but will in fact almost always
be as good as optimal (not use longer names than necessary). The
heuristics are:

  1. We never recolour with a colour that was already in the graph
    before we started. This means that we only have to check the
    colours of adjacent nodes we have already coloured.
    (Translation: we build a big hashtable of every signature of
    every method and avoiding renaming a method to something that
    gives it the same signature as one of these.)

  2. We always visit the methods of a class before those of any of
    its descendants. This means that we can find the adjacent
    nodes already coloured by searching upwards in the
    inheritance dag. There are currently no downward links in it.

The reason this is nearly always as good as real colouring is
that the short names we will be using as new colours hardly ever
exist already as method names, so our first heuristic rarely makes
any difference. The exception is when some but not all of the
classes being distilled were generated by programs that make short
names (such as JDistill itself). In that case we will
typically find that we can't make those names any shorter than
they were already, and we can shorten names in the non-generated
classes less than we would be able to with an optimal algorithm.

Shortening array initializations

As we mentioned earlier, the standard JDK Java compiler from Sun takes
source code like:

    static final int[] powers = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};

and generates bytecode almost as if you had written:
    static final int[] powers;
    static {
powers = new int[10];
powers[0] = 1;
powers[1] = 2;
powers[10] = 1024;

More exactly, the code it generates in the static initializer
<clinit> looks like this (the comments show
what is on the stack after each instruction):
    bipush 11		// 11  (size of array)
    newarray int // int[]
    dup // int[], int[]
    iconst_0 // int[], int[], 0
    iconst_1 // int[], int[], 0, 1
    iastore // int[]  (now array[0] == 1)
    dup // int[], int[]
    bipush 10 // int[], int[], 10
    sipush 1024 // int[], int[], 10, 1024
    iastore // int[]  (now array[10] == 1024)
    putstatic powers // (empty; now powers == the initialized array)

In cases like this where the range of values is relatively small, we can
instead initialize the array as if the code had been this:
    static final int[] powers;
    static {
powers = new int[11];
final String powersString =
for (int i = 0; i < 11; i++)
    powers[i] = powersString.charAt(i);

That is, we construct a String where each character is a value to be
put in the array, and extract those values when the array is created.
This operation is slower than the original unrolled bytecode, which is
why we only do this operation in static initializers; since they are
only executed once presumably the small extra time will make no
difference. (It would be useful to have an option to perform this
operation everywhere big arrays are seen, notably in constructors,
despite the small speed penalty; but currently there is no such option.)

The bytecode we construct to unpack the values from the String into
the array looks like this:

    bipush 11		// 11  (size of array)
    newarray int // int[]
    bipush 10 // int[], 10 (which is the array index i)
    dup2 // int[], i, int[], i
    dup // int[], i, int[], i, i
    ldc "\u0001..." // int[], i, int[], i, i, "\u0001..."
    swap // int[], i, int[], i, "\u0001...", i
    invokevirtual String.charAt(int)
// int[], i, int[], i, <character at i>
    iastore // int[], i  (now array[i] has been set)
    iconst_1 // int[], i, 1
    isub // int[], i-1  (new value of i for next iteration)
    dup // int[], i, i
    ifge loop // int[], i
    pop // int[]
    putstatic powers // (empty; now powers == the initialized array)

We do all the operations on the stack rather than using local
variables, which would yield more straightforward bytecode, because we
don't want to have to allocate a local variable or try to determine if
one is free at the point where we are inserting this code. Similarly,
we add a constant to the currently-declared maximum stack depth rather
than computing what its real value is after the introduction of the
new code, because we have no need to know anything about the stack
usage of instructions anywhere else in the distiller.

If some of the array values are negative or greater than 65535 then
they will not fit in the char values of a
String. In this case, provided that the span of
values is not more than 65535, we can encode in the same way but add
a (possibly negative) constant to each value.

String constants are encoded in class files using
, which has the property that a character c will
be encoded using:

1 byte, if 0 < c < 128;
2 bytes, if c = 0 or 127 < c <
3 bytes, otherwise (2047 < c < 65536)

So in fact we try to determine if adding a constant to each value in
the encoded string will make for a string that takes up less space in
the class file, even if the span of values is such that this is not
strictly necessary. There are only two possible values that we may
want to add: either the smallest value in the array (so that values
near that smallest value are encoded in the efficient 1 - 128 range), or
one less than the smallest value in the array (for the same reason,
but additionally the smallest value itself is encoded as 1 rather than
the less efficient 0). These are the only two possibilities because we
can't add a number bigger than the smallest value in the array
(since characters in the String cannot be negative), and
the fact that the cost function does not decrease except at 1 means that
there is no point in adding a number smaller than the smallest
value in the array minus 1. We try both possibilities mentioned to
see if either of them makes the encoded UTF8 sufficiently smaller to
compensate for the extra instructions needed to add the constant.

After computing the optimal String encoding for the
array, we estimate how much space it will take up in the class file,
taking into account the instructions in the loop and
the fact that it may be necessary to add a reference to
java.lang.String.charAt(int). If it looks as if the
rewritten code will be more expensive than the original, then no
change is made. The calculations here are not completely accurate so
sometimes we may fail to save space when we could have, but the loss is
small, typically a dozen bytes or so, and at worst roughly 50.

Why you might not want to shrink arrays

The Java Language Specification guarantees ( href="">

jdistill.1.gif521 bytes
Related Topics >>