Skip to main content

Native XML support in Dolphin

Posted by kirillcool on July 16, 2005 at 1:10 AM PDT

In "Evolving the Java language" technical session during the last JavaOne, Mark Reinhold shed some light on the future of Java XML support.
The slides for the technical sessions have been finally uploaded to conference webpage, so download the PDF for session TS-7955. The executive summary of the relavant slides is:

  • DOM is excellent feature-wise, but requires too much code to be written, and reads poorly.
  • JDOM is more concise, and if you tilt your head and squint, it looks almost like XML.
  • JAXB 2.0 is schema-driven and can not handle "free" XMLs (w/o schema definition) or "intermediate" XMLs (in the middle of the processing).
  • It would be great to directly write XML, but that will drive compiler guys insane.
  • One of possibilities - using hash mark # to refer to XML attributes or subelements.
  • This can be either to get or to set a new value.
  • Enhanced for loop should allow looping over all instances of some attribute.
  • A lot of generified suggestions for new classes (XML and such).



And now, hoping the above list did not offend too many lawyers, let's proceed. As outlined in the first three items, the current state of affairs in working with XML is far from satisfactory. If you take a web designer, it usually means exceptional HTML and CSS skills, and good Javascript. Javascript is very far from Java, but it's fairly easy to use. How about throwing in an XML parser? We can be all excited with StAX, but the "ease of use" column is somewhat misleading. It's easy to use when you talk with your dog in JVM bytecodes, but it's certainly not easier for non-technical guys.

IBM continues to develop XJ - XML enhancements for Java. XJ program treats XSD schemas as "first-class" citizens, allowing to import them as regular Java classes, read and write attributes and elements, unmarshal strings, streams and files to "virtual" objects and marshal these objects back ("virtual" object is object of class that directly corresponds to some schema artifact, without the need for explicitly generating this class). As Mark pointed out during his talk, this approach is too restrictive - you work only on XML that are valid according to a predefined set of schemas.

The same can be said about JAXB 2.0. It doesn't matter if you start with a schema and create classes, or if you start with your classes. As long as the input can not be completely mapped to your classes, the unmarshaller will fail. In addition, you can not add arbitrary elements during the marshalling.

The approach that Mark outlined in his talk is the complete opposite - no schema, no class for the data, only working with XML tags (that in the proposed syntax can not even be externalized). Complete freedom that comes at cost of optional validation, typechecking and syntax that is far from readable (except for your JVM-compliant dog).

So, what am I looking for? Suppose I have two simple classes, Customer and Order, that look like this:

class Order {
  // has get-set pair
  private int id;
}

class Customer {
  // has get-set pair
  private int id;
  // has get-set pair
  private String name;
  // has get-set pair
  private List orders;
}

Simple annotation with JAXB 2.0 can be putting the following on each class:

@XmlAccessorType(AccessType.FIELD)

And maybe the following on Customer:

@XmlRootElement(name = "customer")

Taking a simple XML

<customer>
  <id>1</id>
  <name>Dan</name>
  <order>
    <id>1</id>
  </order>
  <order>
    <id>1</id>
  </order>
</customer>

I'd like to be able to simply write

String xml = ...;  // contains the above XML
Customer cust = xml;

With the auto-unboxing calling JAXB 2.0 unmarshaller (which is already a part of Mustang). The same auto-unboxing should be provided for File, Reader and InputStream as well. If I want to change my customer, i simply change the field:

cust.setName("Arnold");

The marshalling should be as simple as unmarshalling

String newXml = cust;

Here, the auto-boxing should be called. In this case, toString() default behaviour can be the marshalling using JAXB 2.0 (in case Customer class does not override the default implementation of toString()). Auto-boxing should be also provided for Writer, OutputStream and File.

Looping over elements in Customer should be kept as simple as possible:

for (Order order : cust.getOrders[id>3]) {
  System.out.println(order.getId());
}

Here, the compiler knows the exact type of id field and can invoke the getter function. The amount of extra syntax elements (hash mark, slashes, apostrophes) should be 0. The code should be easy to read.

Now, the more interesting problem. What if we don't have schema? What if we are working on XML that has extra elements or attributes that our functions should simply ignore? What if we need to add extra elements or attributes that our functions wish to add for subsequent modules? The answer is simple, and was introduced long ago in Java, and reinforced in 5.0 with generics - extends keyword. Combined with "implicit" properties and functions introduced on enums in 5.0, clean solutions can be provided to the problems stated above.

Suppose that I get the following XML

<customer>
  <id>1</id>
  <name>Dan</name>
  <age>32</age>
  <order>
    <id>1</id>
    <extId>1000</extId>
  </order>
  <order>
    <id>1</id>
    <extId>2000</extId>
  </order>
</customer>

Marked in red - elements that can not be mapped to Customer and Order classes. However, our code doesn't use these elements at all. How can we make our code work and the compiler happy? Make the compiler perform implicit narrowing conversion:

String xml = ...;  // contains the above XML
Customer cust = xml;

The code is exactly as it was. The marshaller should simply discard all the "irrelevant" information, just as done with regular upcast. Of course, the regular upcast doesn't really change the class, so that you can always downcast back (at your own risk). This will be clarified in the following examples.

Suppose now that you wish to handle the new fields, but you can't change the Customer and Order classes (for example, they are part of external jar). Now the code can look like this:

String xml = ...;  // contains the above XML
extends Customer cust = xml;

The extends keyword instructs the compiler (and the unmarshaller) to keep extra information (exactly as done with enums and name() function). This keyword applies to all internal elements (Order in our case). How can we access the new (undeclared) fields - the same way as regular fields:

System.out.println(cust.age);

Here, the unmarshaller stored the value of age in some internal map, and the compiler retrieves that value for us. Here, there are three possible cases:

  • Single instance with simple value
  • Single instance with complex value (inner elements)
  • Multiple instances

In the first case, the value class will be String, in the second - extends Object and in the third - List. The compiler should allow looping over elements even if they are single entries:


// must extend Object as we don't know it's type
for (extends Object age : cust.age) {
  // implicit function provided by the compiler
  if (age.isSimple()) {
    // the cast will succeed
    System.out.println((String)age);
  }
}

What about adding new elements? Simply call

cust.weight = 180;

This can only compile on extends Customer. If the type of "cust" is Customer, the compiler should issue an error message.

Continuing this line of thought, the regular rules for casting, narrowing and passing objects as parameters apply:

private Customer cust1;
private extends Customer cust2;

void test() {
  // narrowing implicit cast - all undeclared
  // attributes and elements are discarded
  cust1 = cust2;
  // widening implicit cast
  cust2 = cust1;
}

Here we have special case - although both cust1 and cust2 point to the same "base" object, changes to cust1 are seen in cust2, but changes in cust2 are seen in cust1 only on declared fields. If we have another extends Customer cust3 that points to cust2, it's the same object. In this case, all three are poiting to the same object in memory, but calling marshaller on cust1 will emit only declared fields, while calling marshaller on cust2 or cust3 will emit all fields. This way, the referencing model is preserved, and the compiler does not allow adding or retrieving undeclared fields from cust1.

Another example -

Set customers = new HashSet();
Customer cust1 = ...;
extends Customer cust2 = ...;
// implicit widening cast
customers.add(cust1);
// regular insert
customers.add(cust2);
for (Customer cust : customers) {
  System.out.println(cust.age);
}

In this case, this function will print null on the cust1.age - it was implicitly widened based on the type of the Set, but doesn't contain information on age.



Another example:

void foo(Customer cust) {
  // widening cast - undeclared fields
  // can appear after the call.
  bar(cust);
}
void bar(extends Customer cust) {
  // implicit narrowing cast - only declared fields
  // can be changed.
  foo(cust);
}

The extended type may provide access to its elements, as enum does (with implicit functions):

Map getAllElements();

where each entry can be either String or List for collections.


The last example is a recursive function that traverses the input XML and dumps its contents to the console. Arguably, this function is not much simpler than its counterpart for DOM. However, most of its logic is both straightforward and simple. In bold red font - the functions that are generated implicitly by the compiler:

void dumpXml(extends Object xmlObj) {
  // see if it is a simple (and implicitly single) element
  if (xmlObj.isSimple()) {
    System.out.println((String)xmlObj);
    return;
  }
  // see if it is a single (and complex because of the previous
  // check) element
  if (xmlObj.isSingle()) {
    // iterate over inner elements. The getElements()
    // function is generated implicitly in the same way as
    // name() is for enums
    for (Map.Entry element : xmlObj.getElements()) {
      System.out.println(element.getKey()); // element tag name
      dumpXml(element.getValue());          // recurse on value
    }
    return;
  }
 
  // here - we have multiple instances of element with the same
  // tag (collection). Can call .isMultiple() function on
  // xmlObj
  for (extends Object child : xmlObj.getElements().valueSet()) {
    dumpXml(child);             // recurse on the current child
  }
}



Answers to selected comments


The marshalling and unmarshalling exceptions (including I/O and XML format) should be declared as unchecked. Few new exception classes (may be even one) should "envelop" the existing exceptions (too many of them already). If your code wishes to provide corresponding support - you will have to catch the new exceptions and deal with them correspondingly.

The examples are based on attributes rather than accessors, but using JAXB 2.0 this remains purely a choice. You can go either way (and don't forget that XML attributes and elements do imply straightforward field implementation). For undeclared elements (such as age in the examples above), there can be only a public attribute-style access - after all, they are not declared on the corresponding data class.

XJ's support of undeclared attributes requires the same approach as working with DOM and XPath. The only option to parse XML that is not valid according to some schema is to use XMLElement class, which brings back the "ease-of-use" of DOM. The only example I could find of getting information out of XMLElement was using XPath query (in SequenceInstanceOf.xj). All other examples (which are very scarce) use this class to wrap something inside some tag and either output it as XML or put it inside another tag. There are two problems with this approach:

  • Partial processing of XML can not use schema-derived types. If i know that the input XML contains fields that i wish to ignore, i just don't define them in my data model and work on extends object. In XJ, I'll have to work on XMLElement even if I wish to ignore some fields (unless i have schema definition for every intermediate XML state in my business process chain).
  • XMLElement is not a run-time class as far as I can see from the bundle distribution (the documentation is very scarce). The only way to get elements from it is using XPath, even for simple attributes.

Regarding Xen type-system (thanks for the link). It appears that the authors are trying to mold existing language into schema-based classes. The proposed approach modifies the class declarations completely, making the language syntax follow XSD artifacts. Clearly, this is ill-suited not only for DTD and Relax NG, but also to the existing code base. Suppose you have back-end library that provides a lot of functions. All you need to do (in the approach proposed in this entry) to make this library work with XML - annotate the classes correspondingly. With Xen, you'll have to rewrite your library from scratch or create a data-mapping layer to map from new classes to old data classes. Looking into the future - the code as shown above, does not even know that it's working with XML. Tomorrow (a few years from now) this code can be converted to work with another data format (binary XML, OODBS) without touching the business logic, effectively separating the data layer from the business layer.

Related Topics >>

Comments

XML stands for Extensible Markup Language. This form of ...

XML stands for Extensible Markup Language. This form of computer programming transports and stores data as a text file but does not display it. For the data to display, XML must work in conjunction with another language such as PHP, JavaScript or HTML. Data, stored in a tree structure, uses elements designed and named by the person writing the document. Writing data into an XML file is a straightforward process that just about anyone can master with a little practice. Thanks.
Regards,
reputation management