The Source for Java Technology Collaboration
User: Password:



Kirill Grouchnikov

Kirill Grouchnikov's Blog

Native XML support in Dolphin

Posted by kirillcool on July 16, 2005 at 01:10 AM | Comments (18)

In "Evolving the Java language" technical session during the last JavaOne, Mark Reinhold shed some light on the future of Java XML support. The slides for the technical sessions have been finally uploaded to conference webpage, so download the PDF for session TS-7955. The executive summary of the relavant slides is:
  • DOM is excellent feature-wise, but requires too much code to be written, and reads poorly.
  • JDOM is more concise, and if you tilt your head and squint, it looks almost like XML.
  • JAXB 2.0 is schema-driven and can not handle "free" XMLs (w/o schema definition) or "intermediate" XMLs (in the middle of the processing).
  • It would be great to directly write XML, but that will drive compiler guys insane.
  • One of possibilities - using hash mark # to refer to XML attributes or subelements.
  • This can be either to get or to set a new value.
  • Enhanced for loop should allow looping over all instances of some attribute.
  • A lot of generified suggestions for new classes (XML and such).

And now, hoping the above list did not offend too many lawyers, let's proceed. As outlined in the first three items, the current state of affairs in working with XML is far from satisfactory. If you take a web designer, it usually means exceptional HTML and CSS skills, and good Javascript. Javascript is very far from Java, but it's fairly easy to use. How about throwing in an XML parser? We can be all excited with StAX, but the "ease of use" column is somewhat misleading. It's easy to use when you talk with your dog in JVM bytecodes, but it's certainly not easier for non-technical guys.

IBM continues to develop XJ - XML enhancements for Java. XJ program treats XSD schemas as "first-class" citizens, allowing to import them as regular Java classes, read and write attributes and elements, unmarshal strings, streams and files to "virtual" objects and marshal these objects back ("virtual" object is object of class that directly corresponds to some schema artifact, without the need for explicitly generating this class). As Mark pointed out during his talk, this approach is too restrictive - you work only on XML that are valid according to a predefined set of schemas.

The same can be said about JAXB 2.0. It doesn't matter if you start with a schema and create classes, or if you start with your classes. As long as the input can not be completely mapped to your classes, the unmarshaller will fail. In addition, you can not add arbitrary elements during the marshalling.

The approach that Mark outlined in his talk is the complete opposite - no schema, no class for the data, only working with XML tags (that in the proposed syntax can not even be externalized). Complete freedom that comes at cost of optional validation, typechecking and syntax that is far from readable (except for your JVM-compliant dog).

So, what am I looking for? Suppose I have two simple classes, Customer and Order, that look like this:
class Order {
  // has get-set pair
  private int id;
}

class Customer {
  // has get-set pair
  private int id;
  // has get-set pair
  private String name;
  // has get-set pair
  private List orders;
}
Simple annotation with JAXB 2.0 can be putting the following on each class:
@XmlAccessorType(AccessType.FIELD)
And maybe the following on Customer:
@XmlRootElement(name = "customer")
Taking a simple XML
<customer>
  <id>1</id>
  <name>Dan</name>
  <order>
    <id>1</id>
  </order>
  <order>
    <id>1</id>
  </order>
</customer>
I'd like to be able to simply write
String xml = ...;  // contains the above XML
Customer cust = xml;
With the auto-unboxing calling JAXB 2.0 unmarshaller (which is already a part of Mustang). The same auto-unboxing should be provided for File, Reader and InputStream as well. If I want to change my customer, i simply change the field:
cust.setName("Arnold");
The marshalling should be as simple as unmarshalling
String newXml = cust;
Here, the auto-boxing should be called. In this case, toString() default behaviour can be the marshalling using JAXB 2.0 (in case Customer class does not override the default implementation of toString()). Auto-boxing should be also provided for Writer, OutputStream and File.

Looping over elements in Customer should be kept as simple as possible:
for (Order order : cust.getOrders[id>3]) {
  System.out.println(order.getId());
}
Here, the compiler knows the exact type of id field and can invoke the getter function. The amount of extra syntax elements (hash mark, slashes, apostrophes) should be 0. The code should be easy to read.

Now, the more interesting problem. What if we don't have schema? What if we are working on XML that has extra elements or attributes that our functions should simply ignore? What if we need to add extra elements or attributes that our functions wish to add for subsequent modules? The answer is simple, and was introduced long ago in Java, and reinforced in 5.0 with generics - extends keyword. Combined with "implicit" properties and functions introduced on enums in 5.0, clean solutions can be provided to the problems stated above.

Suppose that I get the following XML
<customer>
  <id>1</id>
  <name>Dan</name>
  <age>32</age>
  <order>
    <id>1</id>
    <extId>1000</extId>
  </order>
  <order>
    <id>1</id>
    <extId>2000</extId>
  </order>
</customer>
Marked in red - elements that can not be mapped to Customer and Order classes. However, our code doesn't use these elements at all. How can we make our code work and the compiler happy? Make the compiler perform implicit narrowing conversion:
String xml = ...;  // contains the above XML
Customer cust = xml;
The code is exactly as it was. The marshaller should simply discard all the "irrelevant" information, just as done with regular upcast. Of course, the regular upcast doesn't really change the class, so that you can always downcast back (at your own risk). This will be clarified in the following examples.

Suppose now that you wish to handle the new fields, but you can't change the Customer and Order classes (for example, they are part of external jar). Now the code can look like this:
String xml = ...;  // contains the above XML
extends Customer cust = xml;
The extends keyword instructs the compiler (and the unmarshaller) to keep extra information (exactly as done with enums and name() function). This keyword applies to all internal elements (Order in our case). How can we access the new (undeclared) fields - the same way as regular fields:
System.out.println(cust.age);
Here, the unmarshaller stored the value of age in some internal map, and the compiler retrieves that value for us. Here, there are three possible cases:
  • Single instance with simple value
  • Single instance with complex value (inner elements)
  • Multiple instances
In the first case, the value class will be String, in the second - extends Object and in the third - List<extends Object>. The compiler should allow looping over elements even if they are single entries:

// must extend Object as we don't know it's type
for (extends Object age : cust.age) {
  // implicit function provided by the compiler
  if (age.isSimple()) {
    // the cast will succeed
    System.out.println((String)age);
  }
}
What about adding new elements? Simply call
cust.weight = 180;
This can only compile on extends Customer. If the type of "cust" is Customer, the compiler should issue an error message.

Continuing this line of thought, the regular rules for casting, narrowing and passing objects as parameters apply:
private Customer cust1;
private extends Customer cust2;

void test() {
  // narrowing implicit cast - all undeclared
  // attributes and elements are discarded
  cust1 = cust2;
  // widening implicit cast
  cust2 = cust1;
}
Here we have special case - although both cust1 and cust2 point to the same "base" object, changes to cust1 are seen in cust2, but changes in cust2 are seen in cust1 only on declared fields. If we have another extends Customer cust3 that points to cust2, it's the same object. In this case, all three are poiting to the same object in memory, but calling marshaller on cust1 will emit only declared fields, while calling marshaller on cust2 or cust3 will emit all fields. This way, the referencing model is preserved, and the compiler does not allow adding or retrieving undeclared fields from cust1.

Another example -
Set customers = new HashSet();
Customer cust1 = ...;
extends Customer cust2 = ...;
// implicit widening cast
customers.add(cust1);
// regular insert
customers.add(cust2);
for (Customer cust : customers) {
  System.out.println(cust.age);
}
In this case, this function will print null on the cust1.age - it was implicitly widened based on the type of the Set, but doesn't contain information on age.

Another example:
void foo(Customer cust) {
  // widening cast - undeclared fields
  // can appear after the call.
  bar(cust);
}
void bar(extends Customer cust) {
  // implicit narrowing cast - only declared fields
  // can be changed.
  foo(cust);
}
The extended type may provide access to its elements, as enum does (with implicit functions):
Map getAllElements();
where each entry can be either String or List for collections.

The last example is a recursive function that traverses the input XML and dumps its contents to the console. Arguably, this function is not much simpler than its counterpart for DOM. However, most of its logic is both straightforward and simple. In bold red font - the functions that are generated implicitly by the compiler:
void dumpXml(extends Object xmlObj) {
  // see if it is a simple (and implicitly single) element
  if (xmlObj.isSimple()) {
    System.out.println((String)xmlObj);
    return;
  }
  // see if it is a single (and complex because of the previous
  // check) element
  if (xmlObj.isSingle()) {
    // iterate over inner elements. The getElements()
    // function is generated implicitly in the same way as 
    // name() is for enums
    for (Map.Entry element : xmlObj.getElements()) {
      System.out.println(element.getKey()); // element tag name
      dumpXml(element.getValue());          // recurse on value
    }
    return;
  }
  
  // here - we have multiple instances of element with the same
  // tag (collection). Can call .isMultiple() function on 
  // xmlObj
  for (extends Object child : xmlObj.getElements().valueSet()) {
    dumpXml(child);             // recurse on the current child
  }
}


Answers to selected comments

The marshalling and unmarshalling exceptions (including I/O and XML format) should be declared as unchecked. Few new exception classes (may be even one) should "envelop" the existing exceptions (too many of them already). If your code wishes to provide corresponding support - you will have to catch the new exceptions and deal with them correspondingly.

The examples are based on attributes rather than accessors, but using JAXB 2.0 this remains purely a choice. You can go either way (and don't forget that XML attributes and elements do imply straightforward field implementation). For undeclared elements (such as age in the examples above), there can be only a public attribute-style access - after all, they are not declared on the corresponding data class.

XJ's support of undeclared attributes requires the same approach as working with DOM and XPath. The only option to parse XML that is not valid according to some schema is to use XMLElement class, which brings back the "ease-of-use" of DOM. The only example I could find of getting information out of XMLElement was using XPath query (in SequenceInstanceOf.xj). All other examples (which are very scarce) use this class to wrap something inside some tag and either output it as XML or put it inside another tag. There are two problems with this approach:
  • Partial processing of XML can not use schema-derived types. If i know that the input XML contains fields that i wish to ignore, i just don't define them in my data model and work on extends object. In XJ, I'll have to work on XMLElement even if I wish to ignore some fields (unless i have schema definition for every intermediate XML state in my business process chain).
  • XMLElement is not a run-time class as far as I can see from the bundle distribution (the documentation is very scarce). The only way to get elements from it is using XPath, even for simple attributes.
Regarding Xen type-system (thanks for the link). It appears that the authors are trying to mold existing language into schema-based classes. The proposed approach modifies the class declarations completely, making the language syntax follow XSD artifacts. Clearly, this is ill-suited not only for DTD and Relax NG, but also to the existing code base. Suppose you have back-end library that provides a lot of functions. All you need to do (in the approach proposed in this entry) to make this library work with XML - annotate the classes correspondingly. With Xen, you'll have to rewrite your library from scratch or create a data-mapping layer to map from new classes to old data classes. Looking into the future - the code as shown above, does not even know that it's working with XML. Tomorrow (a few years from now) this code can be converted to work with another data format (binary XML, OODBS) without touching the business logic, effectively separating the data layer from the business layer.

Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • You write about "auto-unboxing calling JAXB 2.0 unmarshaller ..."

    I wonder how IO- and parser exceptions could be handled by the unmarshaller if it is hidden behind autoboxing? Autoboxing works for primitive types because the resulting object adds no 'meaning' but only changes the 'representation'. But treating a String as an XML document implies that it adheres to the XML specification and the unmarshaller should throw an exception if it does not.

    If unmarshalling is not done during autoboxing, the JAXB unmarshaller will be created and invoked explicitly, like any other object. It will also have to throw some kind of parser exceptions as well as IOExceptions (e.g. when the file containing the XML data is not found or contains invalid XML).

    While the resulting source code will be slightly longer, everyone will understand it without having to know that auto-boxing for XML exists (because it then does not exist ;-) ).

    Sebastian

    Posted by: slohmeier on July 16, 2005 at 05:09 AM

  • Very interesting. I've got three basic questions.

    1. This is obviously targeted at XML. Properties files, databases, preferences, and serialized objects (XML and java.io.Serialized) would seem to be just as valid of targets. Why just XML?

    2. Why base this on fields rather than accessor methods? Accessor methods are more bean and interface friendly. Also, fields imply an implementation to the reader that methods don't. Perhaps java should support the more concise field syntax with all bean properties. Until it does, the field syntax should imply a field implementation everywhere.

    3. Does this really merit a language change? For me, it doesn't right now, but that's not important. How do you determine the needs of millions of developers scattered across the globe?

    Posted by: coxcu on July 16, 2005 at 09:24 AM

  • Kirill,
    A correction on XJ --- it does support working with schema-less
    or untyped XML.
    So,


    new Foo(<foo> ..</foo>) // Only works if Foo is in an imported schema


    new XMLElement(<foo>..</foo>) // does no validation checking, only well-formedness.


    Mukund

    Posted by: raghavac on July 18, 2005 at 08:16 AM

  • I think your approach is very intresting. However if xml processing is realy needed as part of the language I think it should be realy simple.

    IMO if you can define a schema the compiler should treat the shema as a first class citcen, without much extra conifg and do all the checking etc - like XJ does. If it is not possible to define a schema the xml should be treated 'loosley-typed' like in a scripting language. Everything in between is too sepecial to add to the language.

    Especially I do not see much use in adding extra complexity to the language by defining a sort of Java schema through annotated classes. I also do not like to see 'extends' again, which I just managed to get through with generics ;-).

    I am sure there are cases you proposals could be useful, but in most cases of 'dynamic' xml I think the annotated-class-schema will be too inflexible and you will have to do a lot of unit-tests anyway so that the compiler-checking is not so important anymore.

    Christian

    Do you know some cases this could be realy helpful?

    Posted by: chrisichris on July 20, 2005 at 10:33 AM

  • -100

    I am strongly against modifying the Java language for "native XML support". XML support should remain at the API level.

    Gili

    Posted by: cowwoc on July 20, 2005 at 12:31 PM

  • I totally agree with cowwoc.
    It would be wrong to make such a significant change to the language to accomodate one niche of development.
    I much prefer the approach of JetBrains' MPS, which should enable you to extend the language and implement the changes you're suggesting without complicating the existing language.
    One of Java's beauties is that the code is so clean and understandable.
    I think the addition of the features in Java 5 have got us up to, and probably just tipped us over the Happy User Peak and any more additions any time soon could well spell the death of the language.

    Graham.

    Posted by: grlea on July 20, 2005 at 07:20 PM

  • Thanks Kirill. Great to see that I'm not alone in my quest for native XML support in Java.

    Posted by: valoxo on July 20, 2005 at 10:37 PM

  • Oh yes, let's make it even bigger and more convoluted!

    Why the heck do people here think that "bigger is always better", want to add everything and the kitchen sink to the language?

    Posted by: jwenting on July 21, 2005 at 01:20 AM

  • have you read

    this?

    Posted by: asjf on July 21, 2005 at 05:39 AM


  • I think embedding XML into Java Language is a great idea - XML is much more expressive and well-structured than plain String's for a start, hence incorporating this into the language would provide much ease for program development ^_^


    As for the proposed syntax, I have a slightly nagging feeling that it would cause some troubles / incompatibilities; I guess we'll need to work a bit harder on that =P


    In fact I am considering the idea of writing a "pre-compiler" that would translate the new syntax so that we can experiment hence work out the scope of how this language feature is going to look like - do you reckon this is a good idea?

    Posted by: alexlamsl on July 21, 2005 at 11:25 AM

  • alexlamsl,
    You are more than welcome to point out the possible troubles / incompatibilities, so we'll be able to discuss them. About the pre-compiler - not too thrilled about the idea of writing such one myself, and a little bit skeptic of the advantages of using such one. Remember the difficulties that Stroustroup was having with translating exceptions from C++ to C? Eventually he gave up and implemented them in Assembler. You'll have to see that the pre-compiler (whose expressive power will be bound by the current syntax) will be able to translate the proposed syntax. If not - you'll have a problem :(

    Posted by: kirillcool on July 21, 2005 at 12:52 PM


  • ok then, let's get started:


    1) The XML syntax doesn't seem to allow for streaming data.


    2) As for the extends keyword - it looks like it will effectively turn off the Java compile-time checks on the associated variable, which could lead to production of (potentially confusing) ???NotFoundException's during run-time; feels like the strong type-checking of Java got weaken there.


    3) The isSimple() etc methods seem to make the syntax looks less useful than the simple straight-forward auto-(un)boxing between String and Object. And I can't immediately see the use of them - since we can just use, in Java terms, instanceOf to perform type-specific operations, now that XML data can be "serialised" into java.lang.Object


    For the moment I am considering working on a pair of methods to convert String (or InputBuffer) to Object (or self-defined XMLObject) and vice versa.


    The requirements for them would be to only throw uncaught Exception's - hence leave the programmers a choice to use the try-catch block. once their implementations are completed they can then be used as intrinsic methods auto-(un)boxing.

    Posted by: alexlamsl on July 21, 2005 at 01:40 PM

  • Before I finished reading, I was on the same path as Coxcu. I don't see why this should be scoped solely to XML. It seems that some sort of interface should be defined that allows other existing and future formats to be included as well.


    Why should this be included at the language level rather than through the PersistenceManager, since this is really what we are talking about doing? I think persistence or I/O in general should be addressed. There are far too many API's already for how data is brought in, out and queried in a Java application. Wouldn't it be nice to iterate, query and persist data without the business logic having to know the protocol/format/channel (JMS, XML, JDBC, CSV, file, stream, http, socket, etc.) Give me an I/O manager that works with a configured DataSource and let me have at it. My business logic shouldn't have to deal with more than that. I should be able to configure a datasource that is an XML format that uses WebDAV to persist to a file/database. I should be able to reconfigure the datasource to be a webservice. I should be able to reconfigure to be a database. It really shouldn't matter to the application how or what I am connected to. The people out there with brilliant minds and the time to do so should try to approach I/O to mapped objects in general.

    If someone wants to deal with data that doesn't have a schema (untyped), it warrants using lower level iteration methods and skipping object attribute mapping altogether. It seems counter intuitive to me to take a language that is by design strongly typed and start introducing ways to get around the strong typing, creating objects whose composition is unknown until runtime.

    Posted by: rcollette on July 22, 2005 at 08:01 AM

  • rcollette,
    So what should happen with unknown data coming from DataSource that you just wish to ignore? The proposed approach fits well with the DataSource implementation that you are looking for - it has no special XML syntax or constructs, effectively abstracting business layer from the data layer.

    Posted by: kirillcool on July 22, 2005 at 08:20 AM

  • Your first wish, implies that Strings being assigned to any object are not type checked at compile time:

    String xml = ...; // contains the above XML
    Customer cust = xml;

    this is only valid if String xml contains XML, but in general the compiler will have no way of knowing if this is true. The String xml would normally have been built up programmatically or read in.

    Posted by: m_r_atkinson on July 22, 2005 at 10:43 AM

  • While this overall approach towards working with Java and XML may sound like a nice idea at first sight, I would not be surprised to have it fall flat on the face very soon. Once you sit down and actually try to spec out in detail how the system is supposed to work and faithfully preserve the XML spec, a gazillion of akward issues will arise. Good luck.

    Better than defining yet another quite limited adhoc mapping mechanism (this time just on the Java syntax level) would be to support a "real" XML manipulation mechanism such as XQuery, and expose it via a *really* straightforward API (without any of the ridiculous JAXP complexity). For some interesting ideas, see the Java Nux project - http://dsd.lbl.gov/nux and Python Amara. Very easy, yet powerful.

    Finally, it's OK for Sun to provide various tools for Java and XML, but please refrain from bundling this into the J2SE core. Have it an external project with external jars, and have it succeed or fail on it's own merits. There's already too much stuff in the JDK, that, in hindsight, was a major mistake to include. Some years ago the Sun policy was to include only proven rock solid stuff into the JDK. These days, the policy seems to be to include arbitrary immature sugar and bloat in response to perceived marketing pressure. In 5 years, will Java just be a pile of mess?

    Posted by: hoschek on July 22, 2005 at 11:11 PM

  • I think including native XML support is not a good idea. I'd like the "simple syntax, featurefull API" philosophy better. You can do Java class - Xml binding in many way, depending on the target task:

    JAXB
    Apache XmlBeans
    XStream
    ... the list goes on...

    However there'd be a reason to include one or more of these API-s in the Java API, like the xerxes parser.

    Posted by: syntern on July 27, 2005 at 09:32 AM

  • Any attempt to incorporate an alien data format as a native part of a language is often pointless and leads to stupid compromises, inflexibility and lock-in. It would be better to improve the Java persistance mechanism so than many import and export formats can be supported, not just XML, e.g. what happens when Binary XML (http://www.w3.org/XML/Binary/) replaces XML in may applications, more naive calls for first class objects lock-in!

    As for the suggestion that String should auto-box to and from a string, that is plain nuts, the toString() method is a major problem, if you want auto-boxing, better to use an intermediary object so that it is known that the data source/sink is for XML, anyhow this has already been adequately handled by the XMLDecoder and XMLEncoder classes, in the java.bean package!

    I write interfaces, for a living, to map from one data format to another, I can tell you that it is nowhere near as easy as some people naively believe, there is always some mismatch you have to work around. All the arguments I see for not using XSDs and a mapping tool are really excuses to not learn how to use XSD and XSD/object mapping tools properly, and the unwillingness to adapt XSDs to properly map between XML and objects e.g. even stupidly evolved XML/object inheritance mappings and duplicate named elements can be supported, if you are know your xs:group etc.

    Posted by: infernoz on March 25, 2006 at 08:16 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds