The Source for Java Technology Collaboration
User: Password:



Eduardo Pelegri-Llopart

Eduardo Pelegri-Llopart's Blog

Simple, Fast, no-Loss binary XML - Fast Infoset

Posted by pelegri on June 04, 2004 at 08:13 AM | Comments (8)

XML has some very nice properties, but the textual encoding is verbose. That is not a problem in many applications, but it is a real issue in some others, specially when dealing with large documents that are transmitted across a slow communication link, or when many of them are sent. For instance, traditional Web Services are sent encoded as textual XML over HTTP; as WS are being adopted more and more widely, I believe deployers will expect efficiencies comparable to RMI.

There have been a number of attempts to address this problem by using some sort of binary encoding of XML. But the benefits of the approaches have not been researched carefully, and the lack of a standard has indered their adoption. I've been involved in a group that investigated this problem some time ago; the explicit goal was to match RMI-performance using WS interfaces and we discovered that if we exploited the type (schema) information in the WSDL we got very close to that goal. We coined the approach Fast Web Services and the approach is being standarized in ISO/ITU-T.

Fast Web Services relies on Schema information, but in some applications the Schema is not available (or even when we have a Schema information, whenever type Any appears), so some complementary solution is needed. Also Fast Web Services does not preserve the XML infoset - think of "sending the content, not the form" - while in many applications that is key. Both requirements are addressed by a technology we call Fast Infoset. The solution has good performance characteristics and is not hard to implement, and Fast Infoset is also being standarized at ISO/ITU-T.

One way to think of Fast Infoset is as a GZIPed XML. It has the same property that you only need to know it is encoded to recover the original. The main difference is that Fast Infoset is customized for XML and leads to better encoding and decoding times. Check out the article by Paul, Alessandro and Santiago to get all the details.

I believe that both Fast Infoset and Fast WebServices are useful; we will find out how much when the standards are finalized later this year and we start seing implementations. There is also a W3C Working Group in XML Binary Characterization that will consider the role of this and other technologies.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • RMI is a dog
    You have got to be kidding me. If fast web services is going to have "efficiencies comparable to RMI" it can only be slower than real XML. RMI is a notorious performance dog. It is routinely outperformed by XML-RPC, SOAP, and other more flexible solutions.

    Posted by: elharo on June 05, 2004 at 05:21 AM

  • RMI is a dog
    I suspect you are refering to problems with serializing of objects in early implementations.

    If you have comparative benchmarks of modern implementations, I would be interested in seing them.

    Posted by: pelegri on June 05, 2004 at 07:49 AM

  • Compound Transactions,Documents,Streams,Proxies
    From the W3C public discussion forum for the topics of Web Applications and Compound Documents...
    Compound Transactions,Documents,Streams,ProxiesSome thoughts

    Separating Transactions From Content Delivery

    In comparison to content delivery over HTTP, FTP and P2P methods, web
    service transaction protocols are a less efficient means for delivering
    large or complex content. Internet connections are not wholly reliable
    and large file transfers regularly fail. In comparison to resuming an
    HTTP,FTP or P2P download, it is inefficient to repackage the content in
    a new SOAP transaction or expect the web service server to hold on to
    the transaction waiting for client reconnection. It is better to
    separate the transaction from the content delivered, by passing URIs
    and maybe decryption keys in the web service transaction, letting the
    client system fetch the content. Using straight HTTP also allows ISPs
    and organizations take advantage of transparent proxy caching. Using P2P
    allows content providers to greatly reduce bandwidth costs.

    Compound Documents Revisited

    Embedding binary content inside text based XML is wasteful and even text
    based content embedded within XML requires some process to embed or
    extract the embedded content. Why not just keep content separate at the
    packaging level. One solution is to ship an archived directory of files
    in a binary format, for example a zipped packaged directory. This
    solution was suggested back in 1999.

    http://lists.xml.org/archives/xml-dev/199902/msg00101.html

    It is easy to "peer into" and "grab" the content of a zip file,
    Java classes and C libraries [and Apache modules] that can do this
    already exist.

    When the www-xml-packaging group formed in July 2000, after a little
    prompting ...

    http://lists.w3.org/Archives/Public/www-xml-packaging/2000Jul/0004.html

    ... a zip/jar type archive solution faced little real competition from
    similar schemes that recode and embed binary content in XML.

    In fact, the zipped/jarred compound document was the solution adopted
    for all of Sun's OpenOffice.org / StarOffice document formats.

    Compound Streams Introduced

    In cases where the embedded content is generated by the same process, or
    the content is better served in a timely manner, ie streamed
    audio/video, then why not create a multichannel binary stream-able
    format, like Vorbis's OGG format, to carry the content over HTTP or any
    other existing streaming protocol. Once delivered the resulting single
    file could be cached or saved on the client side.

    Implementing Compound Documents and Streams for Client Side Web Browsers

    Any introduction of a compound document format or compound stream format
    would require either modification of client side browser or the use of a
    proxy server which expands and separates the content and delivers it to
    the conventional client.

    For both the Jar'ed/Zip'ed compound document and the compound stream
    format, a client side HTTP proxy could download the archive or stream
    and expand the content delivering it to the user's existing web browser.

    To the web browser the expanded content looks as if it originates from
    "http://www.contentprovider.uri/basedirectory/compoundfilename.affix/"
    with meta info inside the archive/stream defining the base URI as
    "http://www.contentprovider.uri/basedirectory/"
    and another file "index.html" being the base HTML content. The other
    content would appear to be relative to the base URI. This means that the
    content can still link and applets can interoperate with the website
    with the same browser scripting security privileges.

    Using the proxy system also introduces the possibility of also
    transparently including Peer to Peer systems to save the content
    provider bandwidth costs.

    Embedded web pages may contain a URI to the compound document, without
    the following slash,
    "http://www.contentprovider.uri/basedirectory/compoundfilename.affix",
    so the user may save the compound document to their file system.

    Using unique filename affixes and/or mime declarations, the desktop
    operating environment can "associate" the compound document formats with
    the client HTTP proxy server.

    Implementing Compound Documents and Streams on the Server Side

    As with the client side, any introduction of a compound document format
    or compound stream format would require either modification of the
    server or the use of a proxy system to gather the separate content and
    bundle it together.

    For a Jar'ed/Zip'ed compound documents is should be possible for a proxy
    system to "request" the content from a conventional web server. The
    proxy would then just zip up the resulting directory of files and send
    it to the client. Compound streams could be served using the same method
    but multi-threaded, delivering the content in real time.

    Document Object Model Access

    The proxy system with the recent Load and Save recommendation

    http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html
    could be used to access and even change content embedded within
    compound documents and compound stream

    David Mohring

    Posted by: nzheretic on June 05, 2004 at 08:01 AM

  • Hessian - Burlap
    I wonder how does this compare to Hessian-Burlap

    http://www.caucho.com/hessian/

    Posted by: ahmetaa on June 07, 2004 at 08:37 AM

  • RMI is a dog
    You're kidding, right? RMI has routinely outperformed SOAP every time I've been made to benchmark the two in a work environment. XML-RPC I haven't investigated before except for experiments over Jabber, but I would be interested to see if it's any better than SOAP.

    It's interesting to see someone who considers XML's considerable parsing and serialisation overhead as a performance feature. :-)

    Posted by: trejkaz on June 07, 2004 at 06:06 PM

  • RMI is a dog

    A whole load of crap. XML has its uses but it is pretty much guaranteed to be the slowest dog in the pack.

    1) XML is extremely verbose. Transfering the documents alone takes many times longer than binary coded messages.

    2) Parsing XML is extremely costly. You require a pretty darn sophisticated parser to do it fast and even then it is going to be orders of magnitude slower than parsing binary messages.

    There are many solutions to the slow dog problem of XML, fast infosets being one. However, to me it is like curing the symptoms but not the disease. If you need something fast, it is not XML, period. A vast majority of all this XML talk is hype and is not based on anything concrete. Time will tell.

    Posted by: tvaananen on June 08, 2004 at 05:06 AM

  • RMI is a dog
    I think that (most) developers are lazy (or efficient, if you wish) and will reuse concepts whenever they can. This is (part of) why Servlets/JSP has been used for so many applications, rather than using EJBs (I said "part of" :-)). I believe the same will happen with the WS APIs: developers will use them in many cases, not using other, perhaps intrinsically more eficient, technologies.

    On the value of XML, the key observation to me is the role of XML to support the paradigm of "document is the truth". That paradigm is different to "data is the truth". In a "document is the truth" model, a participant in a process can act on the document even if it only has very partial information on that document, *without* altering the ability of other participants, with other knowledge, to act on the document. That makes a real difference in many cases. Just as one example, this is why some XML processing can be quite resiliant to changes: an XPath expression describes some partial knowledge of the XML document, and will apply even if the document received has changed in ways unrelated to the XPath expression.

    We should do a technical forum on "document is truth" to have a longer discussion on this.

    Posted by: pelegri on June 08, 2004 at 07:26 AM

  • Hessian - Burlap
    From what I can see...

    Hessian is a "traditional" RPC mechanism, except that it does not support any IDL (interface definition language). In particular, Hessian does not do anything XMLy and it does not support WSDL. I think Hessian is not really about WS, at least by most definitions of WS.

    No idea how the performance would compare.

    But I just had a quick look at Hessian, so all disclaimers apply...

    Posted by: pelegri on June 10, 2004 at 05:21 PM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds