Skip to main content

Archaeology: Digging up JDistill

Posted by emcmanus on April 17, 2011 at 9:19 AM PDT

I was looking over some old stuff, and found JDistill, a byte-code reduction program I wrote in 1998. Although it won't work unchanged on today's class files,
and its copyright status is murky, I thought that the article I wrote might still have some interest. Here it is, with only minor edits.

JDistill, a program to shrink Java packages

Éamonn McManus


3 April 1998

JDistill is a Java program that reduces the size of the class files
that make up a Java package. It does this in several ways:

  • It deletes information that is used for debugging but not actually
    needed for correct execution of the Java bytecode. This includes
    line-number information, for example.

  • It traces through all the executable code in the package that is
    reachable from outside the package in order to determine what is
    referenced. Anything that is never referenced is deleted. This
    can include fields and methods of Java classes, or even entire
    classes themselves.

  • It replaces the names of any methods and fields that have not been
    deleted and that are not visible outside the package with very
    short names. This has the side-effect of making the class file much
    harder to discompile.

  • It detects the frequently very bulky code generated by the current
    JDK javac compiler (versions up to 1.2beta2 at least) to initialize
    arrays, and replaces it with code that takes up much less space in
    the class file.

  • A command-line option tells JDistill that it is being given the
    whole Java program, i.e., no new classes will be added. In this
    case many more methods and fields can be deleted or renamed.

This document explains the rationale for JDistill, shows how to use it
to distill (shrink) packages, and gives some results for the standard
JDK packages in 1.1.5. The final section goes into detail about the
implementation.

Rationale

Java class files serve two distinct purposes. They contain
bytecode to be executed when the program using them is run;
and they contain information used when other Java classes are
compiled. For a Java program that is a product, this additional
information for the compiler is not needed and can be deleted.

Deleting the extra information makes the class files smaller, and also
makes them much harder to decipher. Tools exist that can reconstitute
almost exactly the Java source code that was used to produce a given
class file; but if the names have been dramatically shortened (a form
of obfuscation) this source code will be of little use.

When the Java compiler compiles a Java package, it does not know if
additional classes will subsequently be added to the package. JDistill
operates on the assumption that they will not. That means that
anything within the package that is not referenced now will
never be referenced and can be deleted.

You might imagine that anything JDistill can delete out of a package
could also be deleted from the Java source code by a human user. This
is not true for a number of reasons.

  • When you declare a constant in Java, like this:
    	static final int MAX_VALUE = 32767;

    the compiler must leave this field in the generated class file,
    so that other classes compiled later will be able to obtain its
    value. But no compiled class will ever refer to the field by name,
    because instead of generating a reference to
    SomeClass.MAX_VALUE the compiler will just copy the
    constant value 32767. JDistill will see that there are no references
    to MAX_VALUE in existing class files (because of this
    replacement) and will know that nothing outside the package can
    subsequently refer to MAX_VALUE (because it is not
    public), so it can delete it.

  • A product might exist in several configurations, selected at
    compile time. It is possible to eliminate code within methods
    when it belongs to a configuration that is not enabled, by doing
    something like:
    	if (Config.DEBUG) {
        // debug code...
    }

    But this technique cannot eliminate methods themselves or fields
    or even classes that are only used when Config.DEBUG
    is true. JDistill can.

  • Renaming methods and fields can save a small amount of space and
    can render class files very difficult to decipher. But it would
    be counterproductive to do the renaming in the Java source files!

  • As mentioned above, the Java compiler generates bulky code to
    initialize constant arrays - the same code as it generates for
    non-constant arrays, in fact. If you write:
    	static final int[] powers = {1, 5, 25, 125, 625, 3125};

    the compiler will generate code almost as if you had written:
    	static final int[] powers;
    static {
        powers = new int[6];
        powers[0] = 1;
        powers[1] = 5;
        ...
        powers[5] = 3125;
    }

    (The code is slightly more efficient than that because it can keep
    a reference to the powers array on the operand
    stack.) You can write code that takes up less space
    to initialize the array, by storing the initial values in a
    constant String and pulling them out in a loop, but it is
    unnecessary work to do so. JDistill does it for you automatically.

Using JDistill

2011 Éamonn says: you might want to skip to the section on technical details.

The command line for using JDistill looks like this:

java ORG.opengroup.jdistill.Main [-options] directory [...]

At its simplest, all of the .class files in the named
directory are read, transformed to make them smaller, and written back
over the originals. It is assumed that the class files form a package
and that that package is closed, meaning that no new classes
will be added to it after distilling, or that it will be redistilled after
new classes are added.

Several directories can be named instead of just one; currently the
effect is the same as if JDistill had been run on each directory
individually. Files can also be named instead of directories,
whereupon they are treated separately without the "closed"
assumption.

Invoking JDistill without any arguments or with incorrect arguments
produces a summary of the available options. The options are as
follows. Each option can be abbreviated to any unambiguous prefix.

-output-replace

Followed by a pattern that says where output files should be
written. The pattern should have two parts separated by
=, and each part should contain exactly one
% character. Each input file should match the pattern on
the left of the =, with % being a wildcard
that matches any sequence of characters. The name of the
corresponding output file will be determined by the pattern on the
right of the =, with the % being replaced by
the same sequence of characters. Thus for example the option
	-out /tmp/classes/%=/tmp/shrunkclasses/%

will cause a class file read from
/tmp/classes/java/lang/String.class to be written to
/tmp/shrunkclasses/java/lang/String.class.

The % syntax is inspired by GNU make. The more obvious
* character was avoided to prevent problems with Unix
shell expansion.

If this option is not provided, distilled class files are
written over the original undistilled ones. In that case classes
that are found not to be referenced are not deleted; it would be
more accurate to say that they are not copied when
-output-replace is specified.

-option-file

Followed by the name of a file from which further options are
to be read. Typically there will be -keep-* options
in this file, kept up to date as needed. The syntax of options in
the file is the same as on the command line except that they can
be provided on several lines, and lines beginning with a
# character are ignored.

-verbose

Print a line for each class file that is read to be distilled, each
class that is loaded without being distilled (to provide inheritance
information, for instance), and each class file that is written.

-no-warnings

Do not print messages about constructs found in the class files
that might cause problems with the distillation. Currently the only
warnings are associated with classes loaded dynamically using
Class.forName(String). If the argument is not a
constant string, or if it is a string naming a class that cannot
be found at distill time, a warning is issued.

-class-path

Use the class path immediately following this option instead of
the standard class path when looking for classes that are not
included in the set to be distilled but are referenced from it. Thus
for instance JDistill will almost always want to load
java.lang.Object to find out if any methods in that
class are overridden by the classes being distilled; if
those classes belong to a different system than the one where
JDistill is being run then it would not be right to load the same
java.lang.Object that is being used for the JDistill
code itself.

-keep-class

Do not delete the class following this option even if it is
apparently unreferenced. The Java syntax for a class is used, not
the name of the class file. So
	-keep-class java.io.Thing

might cause the file /unshrunk/java/io/Thing.class to
be copied to /shrunk/java/io/Thing.class even if
java.io.Thing is not a public class and there are no
references to it within java.io.

This option does not affect deletion or renaming of methods and
fields within the named class.

-keep-field

Do not delete or rename the field following this option
even if it would apparently be all right to do so. This is
especially useful for fields that are accessed only from native
code. The field is given in its fully-qualified form,
i.e., the full name of its class followed by the name of
the field, such as java.awt.Font.pData.

If a field is accessed from native code but not using JNI, all
preceding fields should also be kept, or javah should
be run on the distilled classes rather than the originals.

-keep-method

Do not delete or rename the method following this option even if
it would apparently be all right to do so. There is no way to
differentiate among overloaded methods with the same name using
this option: all methods with the given name will be preserved.
Again this is useful for methods that are accessed only from
native code. The method is given in its fully-qualified form but
without parameters, for instance java.io.File.exists.

-keep-constructor

Do not delete any constructors for the class following this option
even if they are apparently unreferenced. The class name is spelt
out as for -keep-class.

-no-main

Do not consider methods with the signature
	public static void main(String[])

as entry points, unless they are in public
classes, in which case they would be entry points anyway. This
option allows main methods introduced for debugging
to be automatically deleted from production code.

-closed-program

Assume that no classes other than those being distilled will ever
be loaded. This means that anything that is not referenced within
the distilled classes is not referenced at all and can be deleted.

-tree

Treat each directory argument as the root of a
directory hierarchy and shrink every file in that hierarchy whose
name matches *.class.

-make-directories

Make intermediate directories as needed when creating output class
files.

-disassemble

Print a disassembly of the structure of each class file as it is
read, including a symbolic representation of the bytecode of each
method. The information printed is similar to that printed by the
standard javap tool, but more detailed.

-show-deletions

Show fields, methods, and classes that are deleted because they
are unreferenced. The information displayed can be very
instructive. It may reveal parts of a program that should have
been deleted during some earlier modification; it may reveal
faulty program logic that means some code is never reached; or it
may indicate things that are only referenced from native code and
so need to be protected using the -keep options. If
the program being distilled was compiled with -O, short
methods will be inlined where they are called, so if they are not
visible outside their package JDistill will legitimately delete
them.

-debug-deletions

Also show constants from the constant pool that are
deleted. These will typically include the names of deleted fields
and methods, and also constant strings or integers that were only
referred to from within deleted methods. If fields and methods
are renamed, the old names will normally be deleted too.

-show-renames

Show fields and methods that are renamed to shorter names. You
may want to keep this information in case you ever have to debug a
stack trace through methods that are now called A or
B.

-debug-renames

Show the analysis that is done to determine what methods can be
renamed. The existence of method overriding and hiding ( href="http://www.javasoft.com/docs/books/jls/html/8.doc.html#228745">Java
Language Specification §8.4.6) means that this
analysis is complex, as described later. With this option, if a
method is not renamed then JDistill explains why, and it also shows
sets of methods that were found to be required to have the same name.

-no-renames

Do not rename any methods or fields.

-show-array-shrink

Show all array initializations that were considered for
shortening, with the savings obtained for those that were indeed
shortened and the reason why not for those that were not.

-debug-array-shrink

Show additional information about array shortening. This
information is principally of interest for debugging JDistill
itself.

-no-array-shrink

Do not shorten any array initializations.

-show-references

Show the results of the JDistill reference analysis before anything
is deleted. Essentially, the list of things that are unreferenced
is the same as the list of things that will be deleted, so the
principal interest of this option is to see how many
references there were to certain things. Currently, only
constants in the constant pool are counted in this way; for other
entities the information is just whether they are referenced or
not.

-trace

Print a detailed trace of the flow of execution as references from
bytecode are determined. This option produces a great deal of
output and is mainly of use for debugging JDistill.

Measurements

These measurements were made on an HP/UX machine running version 1.1.5
of the JDK. All of the standard classes of the JDK were distilled using
all available shrinking. A few -keep-* options had to be
provided to prevent incorrect deletion of fields and methods used only
by native code. With these options in place, the distilled classes get
the same results in the JCK as the original classes.

The overall reduction is a spectacular 32%, but a great deal of that
is attributable to two factors: shortened array initializations,
and classes compiled without optimization. Some classes, for instance
java.lang.Character and almost all of the sun.io
package, contain big arrays which the javac compiler turns into huge
bytecode. A later section will mention some href="#arrayshrinkprobs">problems that the shortened
array initializations may cause. The standard packages distributed as
part of the JDK ought to be compiled with optimization (-O flag),
which as a side-effect leaves out line-number information, but
some of them are not, which makes sun.io even larger
than it would be.

The standard packages measured here are not typical of Java
applications, in that they export many individual methods but there is
less cooperation between classes in individual packages. The fact
that most of the package information is exported greatly limits what
JDistill can do, since publicly-visible fields and methods cannot be
deleted or renamed.

As another example, the JDistill program itself, compiled with
optimization (-O flag), was distilled. Here the saving was
19%, a figure probably more typical of applications. There were no
large arrays to be shrunk so the figures with and without that
optimization were essentially the same.

Package Old size New size Saved
java.lang 160030 97791 62239 (38.9%)
java.lang.reflect 11504 10990 514 (4.5%)
java.util 64858 60109 4749 (7.3%)
java.util.zip 37163 31614 5549 (14.9%)
java.io 111783 103385 8398 (7.5%)
java.math 21682 17622 4060 (18.7%)
java.text 229856 102778 127078 (55.3%)
java.text.resources 376544 323447 53097 (14.1%)
java.net 60327 52467 7860 (13%)
java.security 35759 29163 6596 (18.4%)
java.security.acl 3233 2358 875 (27.1%)
java.security.interfaces 1561 1015 546 (35%)
java.awt 252342 208434 43908 (17.4%)
java.awt.peer 9760 6739 3021 (31%)
java.awt.image 30732 24132 6600 (21.5%)
java.awt.datatransfer 6160 4979 1181 (19.2%)
java.awt.event 31704 26429 5275 (16.6%)
java.applet 3931 3124 807 (20.5%)
java.sql 36624 32136 4488 (12.3%)
java.rmi 16475 12601 3874 (23.5%)
java.rmi.registry 3023 2489 534 (17.7%)
java.rmi.dgc 2593 2010 583 (22.5%)
java.rmi.server 24504 20112 4392 (17.9%)
java.beans 54140 43790 10350 (19.1%)
sun.tools.java 150833 145779 5054 (3.4%)
sun.tools.javac 67557 66230 1327 (2%)
sun.tools.debug 115499 90953 24546 (21.3%)
sun.tools.asm 43774 41376 2398 (5.5%)
sun.tools.tree 310942 293189 17753 (5.7%)
sun.tools.jar 31606 26536 5070 (16%)
sun.tools.util 9821 8273 1548 (15.8%)
sun.tools.javap 21708 17665 4043 (18.6%)
sun.tools.javadoc 49613 15245 34368 (69.3%)
sun.tools.ttydebug 30014 24241 5773 (19.2%)
sun.tools.native2ascii 5077 4228 849 (16.7%)
sun.tools.serialver 6928 5969 959 (13.8%)
sun.misc 54184 40048 14136 (26.1%)
sun.io 5350264 3140814 2209450 (41.3%)
sun.net 13079 10515 2564 (19.6%)
sun.net.ftp 8599 7201 1398 (16.3%)
sun.net.nntp 6880 5677 1203 (17.5%)
sun.net.smtp 4531 3529 1002 (22.1%)
sun.net.www 27227 21537 5690 (20.9%)
sun.net.www.content.text 1441 1062 379 (26.3%)
sun.net.www.content.image 1926 1485 441 (22.9%)
sun.net.www.protocol.file 6001 5160 841 (14%)
sun.net.www.protocol.http 16302 13041 3261 (20%)
sun.net.www.protocol.doc 3871 3400 471 (12.2%)
sun.net.www.protocol.netdoc 1484 1261 223 (15%)
sun.net.www.protocol.verbatim 3468 2897 571 (16.5%)
sun.net.www.protocol.ftp 9285 8084 1201 (12.9%)
sun.net.www.protocol.gopher 8569 7085 1484 (17.3%)
sun.net.www.protocol.mailto 4504 3852 652 (14.5%)
sun.net.www.protocol.appletresource 2916 2506 410 (14.1%)
sun.net.www.protocol.systemresource 7499 6291 1208 (16.1%)
sun.net.www.http 15538 13132 2406 (15.5%)
sun.net.www.httpd 6589 5685 904 (13.7%)
sun.audio 18830 9600 9230 (49%)
sun.awt 36519 26444 10075 (27.6%)
sun.awt.motif 173468 138929 34539 (19.9%)
sun.awt.image 59173 45877 13296 (22.5%)
sun.awt.tiny 147662 117500 30162 (20.4%)
sun.applet 92317 75045 17272 (18.7%)
sun.applet.resources 14299 12983 1316 (9.2%)
sun.security.acl 13307 10566 2741 (20.6%)
sun.security.util 22505 18072 4433 (19.7%)
sun.security.x509 44805 37444 7361 (16.4%)
sun.security.pkcs 25451 22040 3411 (13.4%)
sun.security.provider 72578 57993 14585 (20.1%)
sun.rmi.registry 12711 10733 1978 (15.6%)
sun.rmi.server 19483 16666 2817 (14.5%)
sun.rmi.transport 56585 45911 10674 (18.9%)
sun.rmi.transport.proxy 40943 32851 8092 (19.8%)
sun.rmi.transport.tcp 45573 37181 8392 (18.4%)
sun.rmi.rmic 45814 38269 7545 (16.5%)
sun.beans.editors 17552 15000 2552 (14.5%)
sun.beans.infos 1477 1235 242 (16.4%)
sun.jdbc.odbc 152157 125688 26469 (17.4%)
sunw.util 537 266 271 (50.5%)
sunw.io 231 100 131 (56.7%)
Total (80 packages) 9097794 6160053 2937741 (32.3%)
Average 113722 77001 36722 (32.3%)

Technical details

JDistill does three essentially independent tasks when working on a
package:

  1. It traces the bytecode to find out what is referenced and deletes
    whatever is not;

  2. It shortens the names of fields and methods whose existing names
    are not visible outside the package;

  3. It shortens array initializations.

Finding references

The algorithm for tracing all of the bytecode in a package that can
ever be reached is conceptually simple: start from the public
entry points
and trace through instructions; every time you see
an instruction that references a field, mark that field as referenced;
every time you see an instruction that invokes a method, mark that
method as referenced and trace it too.

The public entry points of a package are
determined as follows:

  • Every public method in a public class
    is an entry point because it can be called directly by Java code
    outside the package.

  • Every protected method in a public class
    is an entry point because it can be called by a subclass of
    the class outside the package (this is not true if the class is
    final but we don't currently check that).

  • Every public or protected method in a
    class that is not public but that is subclassed by a
    public class in the same package is an entry point
    because it is visible through the subclass.

  • A method that looks like
    	public static void main(String[] args)

    is an entry point because a Java program can start there, even if
    it is in a class that is not public. Since such
    a method often contains test code for the containing class that is
    not supposed to be included in the final product, JDistill has a
    command-line option -no-main to tell it not to consider
    this kind of entry point unless it is covered by one of the other
    rules.

  • Every method, even a private one, in a class that
    contains native method declarations is an entry
    point. Native code can typically access any field or method of
    any class regardless of Java-language visibility. We can't know
    what fields or methods are accessed in this way, and we don't want
    to keep all fields and methods just in case they might be accessed
    by native code, so we assume that classes with native
    methods are the only ones whose methods or fields are accessed in
    this way. When this assumption is wrong, it can be corrected as
    explained in the next paragraph.

  • Every method that was explicitly named by a
    -keep-method or
    -keep-constructor
    command-line option is an entry point. You can use this to
    preserve methods called only from native code and not included in
    the heuristic described just above.

A constructor is considered to be a method like any other
for the purposes of these rules.

If the -closed-program option is given, only the last two
rules are applied, but then if a class containing native
methods is determined to be referenced, all of its methods are
considered to be entry points.

While we are looking for public entry points, we also look for
visible fields, that is fields
that can be accessed directly or indirectly from outside the package.
The rules here are the following:

  • Every public or protected field in a
    public class or a class that has a
    public subclass in the package is visible. This is
    essentially the same rule as for methods.

  • Every field in a class that has native methods is
    visible, for the same reasons as explained for methods above.

  • Every field in a class that implements java.io.Serializable
    (directly or by inheritance) is visible unless it is
    static or transient, because changing
    the field in any way would break compatibility with
    previously-serialized objects of this class.

  • If a class that implements java.io.Serializable has a
    field called serialVersionUID then that field also
    participates in serialization even if it is static
    (in fact it always is static).

  • Every field that was explicitly named by a
    -keep-field command-line option
    is visible. Again this can be used to prevent the deletion of fields
    that are only referenced from native code and which are in classes
    that do not themselves contain native methods.

Here again, the public rule does not come into play when
-closed-program is given, and the other rules except the
last are only applied to classes that are found to be referenced.

Tracing bytecode

We examine the instructions of every method that is an entry point as
explained above, and every method that is called by such a method, and
so on recursively. The following sorts of instructions have to be
noted:

  • A checkcast, instanceof,
    anewarray, getstatic,
    putstatic, or invokestatic instruction
    causes the class it references to be
    visited statically,
    as explained below.

  • A new, getfield, putfield,
    invokevirtual, invokespecial, or
    invokeinterface instruction causes the class it
    references to be visited
    dynamically
    , as explained below.

  • The various invoke* instructions also cause the
    methods they invoke to be traced if they have not already been.

  • If an invokestatic instruction turns out to invoke
    the java.lang.Class.forName(String) method then we have
    detected a dynamically-loaded class. If the class being loaded is
    known (the argument to forName is a constant
    String) then it is visited
    statically
    .
    Furthermore, its no-arg constructor is traced, because it
    might be invoked through the newInstance() method in
    java.lang.Class. (An earlier version only traced the
    no-arg constructor if a call to newInstance() was
    seen within the package, but this was wrong because the
    Class reference could easily be passed outside the
    package and newInstance() invoked there.)

    If the argument to Class.forName is not a
    constant String, a warning is issued, because JDistill's
    assumptions about reachability might be violated.

    You will need to use the various -keep-* command-line
    options if the package being distilled contains classes, or fields and
    methods within classes, that are only referenced through dynamic
    loading and reflection.

  • The field-access instructions (getstatic etc) also
    cause the fields they name to be marked as referenced.

  • The ldc, ldc_w, and ldc2_w
    instructions cause the constants they reference to be marked as
    referenced.

A class is visited statically when
we know that it will be initialized (in the sense of the href="http://www.javasoft.com/docs/books/jls/html/12.doc.html#44560">
Java Language Specification, §12.4.1). In this case the
class is marked as referenced if it was not already, and if it has a
static initializer method (<clinit>) that method's
bytecode will be traced.

A class is visited dynamically when
we know that there must be at least one instance of it for the program
to be correct. In this case we have further information about the class:

  • If it implements the java.io.Serializable
    interface and contains one or both of the methods
    	void readObject(java.io.ObjectInputStream in)
    void writeObject(java.io.ObjectOutputStream out)

    then those methods are potentially referenced and must be traced.

  • If it overrides methods in an ancestor class that are referenced
    then the overriding methods must also be considered to be
    referenced. This is further explained in the next section.

We do not explicitly visit a parent class dynamically when one if its
children is visited dynamically, but if the child is really
instantiated then we should end up visiting at least one of its
constructors, and this constructor will start with an invocation of
one of the parent's constructors, which will provoke a dynamic visit.

Handling overriding methods

Once all of the public entry points and all of the methods they
invoke have been traced recursively, a further important way to reach
methods must be considered. A method that is never itself directly
invoked, but that overrides a method in an ancestor class that is,
must be traced, provided that the class it is contained in is actually
instantiated.

Finding indirectly-referenced overriding methods is not completely
straightforward. Here is how we tackle it. We loop over all the
classes in the package looking for non-static methods
that are currently not referenced. For each such method, we recursively
scan the ancestors (superclasses and superinterfaces) looking for
methods with the same name and parameters, which are therefore
overridden; we thus construct a set containing the first overridden method
found on each path through the ancestor graph. If we find a referenced
overridden method, then we mark the overriding method as referenced too
and trace it. After we have scanned all the classes in the package and
if we found any overriding methods, we repeat the whole process, until
no further overriding methods are found.

In this way we are sure to propagate the referencedness all the way
down a possibly long chain of overrides, and we are sure to pick up new
chains of overrides that start at a method that is itself only referenced
in an overriding method.

A more straightforward approach would be possible if we linked classes
to all their subclasses and then methods to all their overriders, but
building that information would not be much simpler than what we do
here.

In the diagram,
Tracing indirect method references align=right>

the method "0" is the only one explicitly referenced in the program.
The methods labeled "1" are in subclasses that override it, and the
methods labeled "2" override those. The first time we loop through
all the classes we will find that the methods "1" are unreferenced but
override a referenced method, so we will mark them referenced and
trace them. Then either later in that scan through the classes or on
the next scan, depending on the order in which we visit the classes,
we will do the same thing for the methods labeled "2".

Renaming fields and methods

JDistill renames fields and methods in the package being shrunk so that
they are shorter. Shortening the name of a field or method reduces
space not just in the class where it is defined, but in every class
that references it. If a member (a field or method) is referenced in
a class there will be a single constant naming the class containing the
member and the name of the member itself. This constant will be shared
between all the references from the same class, so renaming will have
most effect when there are many small classes rather than a few big ones.

Renaming fields

Renaming fields is straightforward. Any field that is not href="#visiblefield">visible can be renamed. JDistill renames fields in a
given class starting at A and counting up to Z,
then a to z, then _, then
longer names starting with these characters but also including the
digits 0 to 9. This means that all new names
are valid Java identifiers. The specification of class files does not
actually require this, but it seems best to avoid possible problems with
incorrect implementations that do.

The only other constraint on chosen field names is that they must not
conflict with existing field names in the same class, specifically those
that are visible. This is easily done using a hash table that is needed
anyway in order to follow field references to the corresponding field.

Renaming methods

Renaming methods is much harder because of overriding. The field
names in a class are independent of those in its ancestors and
descendants; class files specify exactly which class's field is being
referenced. The names of methods in a class on the other hand are
intimately tied to the names of methods in its ancestors and descendants.
If a method has the same name as a method in an ancestor, then both
names must still be the same after renaming. Contrariwise, if
the name of a method is different from the name of another method in
the same class or in any ancestor or descendant, then it must remain
different after renaming.

Overloaded methods

Overloading also complicates method renaming slightly. Where we
talked about names being different in the previous paragraph, we
should really have talked about signatures. That is, the
combination of name and parameter types must be the same or
different according as there is an overriding relationship between two
methods.

Methods that must be renamed together

Before we can rename any methods, we must group together all methods
that have an overriding relationship so that they will all continue to
have the same name after renaming. This is straightforward. With
every method in every class, we associate a set of other methods.
This is usually a set of methods that must have the same name; but
if a method must not be renamed, either because it is an href="#entry">entry point or because it has an overriding relationship
with an entry point, then it will belong to a special set called the
fixed-name set.

To construct the sets of methods we proceed as follows. We visit the
classes in ancestor-first order. For each method in each class we
visit, we first check if it can be renamed at all. If not, it is
added to the fixed-name set. If so, it is added to a newly-created
set. Then, for every method that the method we are visiting
overrides, we merge together the sets for the overriding and
overridden methods. In this way, if any method in the group of
methods that must have the same name as the current method turns out
not to be renameable, all of the methods in the group will end up in
the fixed-name set. If all methods in the group can be renamed then
they will all end up in the same set.

Finding new names

We can consider finding new method names as a
graph-colouring problem. In the graph, the nodes are methods and
an edge joins two nodes if the corresponding methods must not
have the same name, because they have the same parameters and
either they are in the same class or one of them is in an
ancestor class of the other. (Methods for which this is true and
that already have the same name already have an overriding
relationship and can be coalesced into the same node.) The graph
is actually a collection of disjoint graphs, one for every
possible parameter signature. These graphs can be considered to
be coloured when we start, where the colours are method names and
no two nodes connected by an edge can have the same colour. What
we want to do is to recolour those nodes where this is allowed so
as to use more efficient colours (shorter names).

Building this graph would be expensive and unnecessarily
complicated. Instead we use some heuristics to achieve a result
that is theoretically non-optimal but will in fact almost always
be as good as optimal (not use longer names than necessary). The
heuristics are:

  1. We never recolour with a colour that was already in the graph
    before we started. This means that we only have to check the
    colours of adjacent nodes we have already coloured.
    (Translation: we build a big hashtable of every signature of
    every method and avoiding renaming a method to something that
    gives it the same signature as one of these.)

  2. We always visit the methods of a class before those of any of
    its descendants. This means that we can find the adjacent
    nodes already coloured by searching upwards in the
    inheritance dag. There are currently no downward links in it.

The reason this is nearly always as good as real colouring is
that the short names we will be using as new colours hardly ever
exist already as method names, so our first heuristic rarely makes
any difference. The exception is when some but not all of the
classes being distilled were generated by programs that make short
names (such as JDistill itself). In that case we will
typically find that we can't make those names any shorter than
they were already, and we can shorten names in the non-generated
classes less than we would be able to with an optimal algorithm.

Shortening array initializations

As we mentioned earlier, the standard JDK Java compiler from Sun takes
source code like:

    static final int[] powers = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};

and generates bytecode almost as if you had written:
    static final int[] powers;
    static {
powers = new int[10];
powers[0] = 1;
powers[1] = 2;
...
powers[10] = 1024;
    }

More exactly, the code it generates in the static initializer
<clinit> looks like this (the comments show
what is on the stack after each instruction):
    bipush 11		// 11  (size of array)
    newarray int // int[]
    dup // int[], int[]
    iconst_0 // int[], int[], 0
    iconst_1 // int[], int[], 0, 1
    iastore // int[]  (now array[0] == 1)
    dup
    iconst_1
    iconst_2
    iastore
    ...
    dup // int[], int[]
    bipush 10 // int[], int[], 10
    sipush 1024 // int[], int[], 10, 1024
    iastore // int[]  (now array[10] == 1024)
    putstatic powers // (empty; now powers == the initialized array)
    return

In cases like this where the range of values is relatively small, we can
instead initialize the array as if the code had been this:
    static final int[] powers;
    static {
powers = new int[11];
final String powersString =
    "\u0001\u0002\u0004\u0008\u0010\u0020\u0040\u0080\u0100\u0200\u0400";
for (int i = 0; i < 11; i++)
    powers[i] = powersString.charAt(i);
    }

That is, we construct a String where each character is a value to be
put in the array, and extract those values when the array is created.
This operation is slower than the original unrolled bytecode, which is
why we only do this operation in static initializers; since they are
only executed once presumably the small extra time will make no
difference. (It would be useful to have an option to perform this
operation everywhere big arrays are seen, notably in constructors,
despite the small speed penalty; but currently there is no such option.)

The bytecode we construct to unpack the values from the String into
the array looks like this:

    bipush 11		// 11  (size of array)
    newarray int // int[]
    bipush 10 // int[], 10 (which is the array index i)
  loop:
    dup2 // int[], i, int[], i
    dup // int[], i, int[], i, i
    ldc "\u0001..." // int[], i, int[], i, i, "\u0001..."
    swap // int[], i, int[], i, "\u0001...", i
    invokevirtual String.charAt(int)
// int[], i, int[], i, <character at i>
    iastore // int[], i  (now array[i] has been set)
    iconst_1 // int[], i, 1
    isub // int[], i-1  (new value of i for next iteration)
    dup // int[], i, i
    ifge loop // int[], i
    pop // int[]
    putstatic powers // (empty; now powers == the initialized array)
    return

We do all the operations on the stack rather than using local
variables, which would yield more straightforward bytecode, because we
don't want to have to allocate a local variable or try to determine if
one is free at the point where we are inserting this code. Similarly,
we add a constant to the currently-declared maximum stack depth rather
than computing what its real value is after the introduction of the
new code, because we have no need to know anything about the stack
usage of instructions anywhere else in the distiller.

If some of the array values are negative or greater than 65535 then
they will not fit in the char values of a
String. In this case, provided that the span of
values is not more than 65535, we can encode in the same way but add
a (possibly negative) constant to each value.

String constants are encoded in class files using
UTF8
encoding
, which has the property that a character c will
be encoded using:

1 byte, if 0 < c < 128;
2 bytes, if c = 0 or 127 < c <
2048;
3 bytes, otherwise (2047 < c < 65536)

So in fact we try to determine if adding a constant to each value in
the encoded string will make for a string that takes up less space in
the class file, even if the span of values is such that this is not
strictly necessary. There are only two possible values that we may
want to add: either the smallest value in the array (so that values
near that smallest value are encoded in the efficient 1 - 128 range), or
one less than the smallest value in the array (for the same reason,
but additionally the smallest value itself is encoded as 1 rather than
the less efficient 0). These are the only two possibilities because we
can't add a number bigger than the smallest value in the array
(since characters in the String cannot be negative), and
the fact that the cost function does not decrease except at 1 means that
there is no point in adding a number smaller than the smallest
value in the array minus 1. We try both possibilities mentioned to
see if either of them makes the encoded UTF8 sufficiently smaller to
compensate for the extra instructions needed to add the constant.

After computing the optimal String encoding for the
array, we estimate how much space it will take up in the class file,
taking into account the instructions in the loop and
the fact that it may be necessary to add a reference to
java.lang.String.charAt(int). If it looks as if the
rewritten code will be more expensive than the original, then no
change is made. The calculations here are not completely accurate so
sometimes we may fail to save space when we could have, but the loss is
small, typically a dozen bytes or so, and at worst roughly 50.

Why you might not want to shrink arrays

The Java Language Specification guarantees ( href="http://www.javasoft.com/docs/books/jls/html/3.doc.html#101083">§3.10.5)
that two String constants that contain
the same sequence of characters refer to the same instance of
java.lang.String. It is not clear that this guarantee is
particularly useful, but in any case it has an important consequence
for the array optimization described in the previous section. The
guarantee is enforced by interning ( href="http://www.javasoft.com/docs/books/jls/html/javalang.doc11.html#14026">JLS
§20.12.47) every String constant in a class file
when it is loaded.

A correct JVM would garbage-collect interned strings that are no
longer referenced anywhere. A Java program cannot be affected by
this: although the next time the same string is interned a different
String instance will be returned, a program cannot
know that this has happened because by hypothesis it no longer
has a reference to the original interned instance against which it
could compare the new one. (The addition in Java 1.1 of the method
java.lang.System.identityHashCode(Object) means that this
is not strictly true, because the program could hold on to the interned
string's hash code but not the string itself; some clarification from
Sun would be helpful on this matter.)

The JDK 1.1 from Sun does not garbage-collect
interned strings. We consider this a bug because a long-running program that
loads or synthesizes many classes may eventually run out of memory,
since all of the string constants in those classes are retained even
though the classes themselves are garbage-collected. The problem can
be very easily illustrated with this simple program:

    class Intern {
public static void main(String[] args) {
    for (int i = 0; ; i++) {
if (i % 1000 == 0)
    System.out.println("i = " + i);
Integer.toString(i).intern();
    }
}
    }

Try running this program and watch how it gets slower and slower as
the values of i increase. Eventually it will crash with an
OutOfMemoryError, even though none of the interned
strings is referenced.

The same program under JDK 1.2 does not exhibit the same behaviour:
href="http://developer.javasoft.com/developer/bugParade/bugs/4035345.html">the
bug has been fixed there.

The reason this might be cause for concern if you are using a
1.1-based system is that the string constants introduced to intialize
arrays may take up less space in the class file, but chances are that
they will take up more space in memory as interned strings
when the distilled program is run. And they will stay around even if
the class they were in is garbage-collected. Typically this is not so
much of a concern if your main motivation for distilling the class
files was to save disk space or transmission time; but if it was
instead to save space in a low-memory environment such as an embedded
system then you will want to make measurements to see whether this
optimization really does save memory. If not, distill with
-no-array-shrink.

Futher work

A number of improvements to JDistill are possible.

  • Rename classes as well as fields and methods. Classes are a
    little more difficult to rename than fields and methods because
    references to them can be hidden inside method signatures, but this is
    still an easy improvement to make.

  • Shorten a broader class of array initializations. Currently
    JShrink only considers shortening arrays of integer type where the
    difference between the largest and smallest elements is not more
    than 65535. It would be easy to extend this to all integer arrays
    by using more than one char per array element; to
    floating-point arrays by treating the floating-point numbers as
    integers using java.lang.Float.floatToIntBits(float)
    etc; and to String arrays by finding a character not
    used in any of the strings and using it as a separator so that the
    array can be broken apart with
    java.util.StringTokenizer. These optimizations are
    presented in decreasing order of likely usefulness.

  • Shorten array initializations everywhere they are seen, not just
    in class initializers. Since the rewritten initializations are
    slightly slower than the originals, this would be under control of
    a command-line option, perhaps saying for each method whether
    initializations in that method should be shortened, in the manner
    of the -keep-method option.

  • Inline methods that are determined not to be overridden. Inlining
    methods starting from bytecode is relatively difficult in general
    because the bytecode preceding the method invocation must be
    analysed to determine what the arguments are, and bytecode in the
    invoked method must be rewritten to incorporate those arguments.
    But object-oriented programs often have empty methods and those at
    least could be inlined without any great difficulty.

  • Delete pointless overrides. If a method consists of nothing but a
    call to the method of the same name in the superclass, and if the
    access modifiers of the parent and child methods are the same, the
    child method can be deleted. This could be subject to a
    command-line option, since it would be visible to reflection. The
    advantage is that programmers can write, e.g,
        public String toString() {
    if (debug)
        return complicatedOperationToProduceStringRepresentation();
    else
        return super.toString();
        }

    knowing that this code will be removed if debug is a
    constant equal to false.

Related information

This work was inspired by seeing the reductions that the href="http://www.sbktech.org/hashjava.html">Hashjava obfuscator
could achieve, even though its primary focus is obfuscation rather
than size reduction. It does not rename overloaded methods, nor does
it trace bytecode to find unreferenced items. Hashjava is written
entirely in Java.

DashO is a
bytecode-to-bytecode performance optimizer by preEmptive Solutions.
It does much more than JDistill, actually rewriting bytecode to
perform optimizations that the compiler could have done but did not.
Its "DashO-Pro" version does these optimizations and also removes
unnecessary information just as JDistill does. PreEmptive Solutions
claims to have a trademark and a patent pending on its `Overload
Induction Renaming System', which appears to be a solution to the href="#renamemethods">method renaming graph problem described
above. They do not say exactly what their patent would cover.

JDistill was originally called JShrink, until we discovered that that
name was already taken by href="http://www.e-t.com/jshrink.html">Eastridge Technology.
From their description, Jshrink works in a similar way to JDistill.
They do not mention if it handles overridden methods. Their product
is not written in Java and only runs on Microsoft operating systems.

Contact information

Send technical questions to href="mailto:emcmanus@opengroup.org">emcmanus@opengroup.org.

2011 Éamonn says: well, you can try, but I don't think it will work.

AttachmentSize
jdistill.1.gif521 bytes
Related Topics >>