Skip to main content

The Sordid Tale of XML Catalogs

Posted by cayhorstmann on December 12, 2011 at 9:14 PM PST

I am finishing the code samples for my book “Scala for the Impatient”. (Yes, for those of you who are impatiently awaiting it—the end is near. Very near.)

In the XML chapter, I started an example with

val doc = XML.load("http://horstmann.com/index.html")
doc \ "body" \ "_" \ "li"

It took several minutes for the file to load. What gives? My network connection wasn't that slow. And neither is the Scala XML parser—it just calls the SAX parser that comes with the JDK.

The problem is DTD resolution. The file starts out with

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

So, the parser feels compelled to fetch http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd, and rightly so, because it needs to be able to resolve entities such as &auml; in the file.

Except, the W3C hates it when people fetch that file, and rightly so—they shouldn't have to serve it up by the billions. It should be up to the platform to cache commonly used DTDs.

My platform, Ubuntu Linux, happens to have a perfectly good infrastructure for caching DTDs. Schema files too. There is a file /etc/xml/catalog that maps public ID prefixes to other catalog files. For example, the prefix "-//W3C//DTD XHTML 1.0" is mapped to /etc/xml/w3c-dtd-xhtml.xml, which maps "-//W3C//DTD XHTML 1.0 Strict//EN" to /usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, which maps to the final destination, xhtml1-strict.dtd. I am pretty sure this is the same on other Linux systems too.

So, of course the JDK takes advantage of this infrastructure, right? No—or I wouldn't have had the problem that I described.  Here is what I had to do to make it work.

The JDK takes its SAX implementation from Apache, and Apache has a CatalogResolver class. The JDK has it too, well-hidden at com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver. Ok, let's use it and delegate to it in the regular SAX handler.

import java.net.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import com.sun.org.apache.xml.internal.resolver.tools.*;

public class SAXTest {
   public static void main(String[] args) throws Exception {
      final CatalogResolver catalogResolver = new CatalogResolver();
      DefaultHandler handler = new DefaultHandler() {
            public InputSource resolveEntity (String publicId, String systemId) {
                return catalogResolver.resolveEntity(publicId, systemId);
            }
            public void startElement(String namespaceURI, String lname, String qname,
               Attributes attrs) { // the stuff you'd normally do
               if (lname.equals("a") && attrs != null) {
                  for (int i = 0; i < attrs.getLength(); i++) {
                     String aname = attrs.getLocalName(i);
                     if (aname.equals("href")) System.out.println(attrs.getValue(i));
                  }
               }
            }
         };

      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setNamespaceAware(true);
      SAXParser saxParser = factory.newSAXParser();
      String url = args.length == 0 ? "http://horstmann.com/index.html" : args[0];
      saxParser.parse(new URL(url).openStream(), handler);
   }
}

Does it work? No. The compiler complains that there is no package com.sun.org.apache.xml.internal.resolver.tools. That's bull:

jar tvf /path/to/jdk1.7.0/jre/lib/rt.jar | grep /CatalogResolver
  6757 Mon Jun 27 00:45:14 PDT 2011 com/sun/org/apache/xml/internal/resolver/tools/CatalogResolver.class

Take this, Java:

javac -cp .:/path/to/jdk1.7.0/jre/lib/rt.jar SAXTest.java

It compiles. It runs. (As an aside, this is pretty weird. I didn't realize that the compiler excludes some classes from rt.jar.)

Does it work? No. But there is a useful warning: Cannot find CatalogManager.properties. That's the final missing step. Create a file CatalogManager.properties with the entry

catalogs=/etc/xml/catalog

and put it somewhere on the class path. (No, /path/to/jdk/jre/lib/ext doesn't work, which probably isn't a bad thing.) Or start your app with

java -Dxml.catalog.files=/etc/xml/catalog SAXParser

Did it work? No. It turns out that Linux isn't all that perfect in its XML catalog infrastructure. The catalog.xml file has itself a DTD, like this:

<!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN"
    "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">
globaltranscorp.org is no longer, so downloading the DTD is futile. But wait—don't we have a perfectly good mechanism for using the public ID and locating the cached copy? The Ubuntu folks put the blame on Apache, and I am inclined to agree with them.

Anyway, the fix is to replace the system ID with "/usr/share/xml/schema/xml-core/tr9401.dtd".

Now it works. But it's ugly. Why can't it work by default? Or at least by default when -Dxml.catalog.files is set?

BTW, I am aware that I can get a CatalogManager implementation from Apache, and that it will likely work fine when mixed with the Java XML implementation. I just feel that I shouldn't have to do that.

What about other platforms? On the Mac, I found a catalog file at /opt/local/etc/xml. It only had a few Docbook DTDs, not XHTML. I don't know how you add to it (except, of course, manually). In Ubuntu, it's sudo apt-get install w3c-dtd-xhtml. How about Windows? I hope that some of you can tell me.

In Scala, it's a little messier to use the catalog resolver since the parser installs its own SAX handler.  The following works:

import xml._
import java.net._

object Main extends App {
  System.setProperty("xml.catalog.files", "/etc/xml/catalog")

  val res = new com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver

  val loader = new factory.XMLLoader[Elem] {
    override def adapter = new parsing.NoBindingFactoryAdapter() {
      override def resolveEntity(publicId: String, systemId: String) = {
        res.resolveEntity(publicId, systemId)
      }
    }
  }

  val doc = loader.load(new URL("http://horstmann.com/index.html"))
  println(doc);
}

Don't ask. This doesn't use the documented API, just what I gleaned from reading the source.

Scala users have an alternative parser, ConstructingParser. Does it resolve entities? Nope. It replaces them with useless comments <!-- unknown entity auml; -->. Don't ask.

Overall, this enough to make grown men cry. In my Google searches, I ran across a good number of apps that maintained their own catalog infrastructure. Caching these DTDs isn't something that every app should have to reinvent. The blame falls squarely on the Java platform here. (In Linux, there are C++ based tools that have no trouble with any of this.) Java should support the catalog infrastructure where it exists, and allow users to manually manage the catalogs and communicate the location with a global setting, not something on the classpath or the command line.

 

 

Related Topics >>

Comments

You say &quot;The blame falls squarely on the Java platform ...

You say "The blame falls squarely on the Java platform here." While I surely agree that the situation in Java's platform (and everybody else's!) is a load of [censored], fundamentally, the W3C is at fault and I consider their overloaded servers to be part of the penance they owe us all for creating this mess. ;) In short: the doctype URI should have been a magnet: URI (or similar; XML predates the magnet: scheme itself by a few years, but the ideas are much, much older), containing 1) a mandatory cryptographic checksum of the DTD and 2) an optional set of URLs for download. The cryptographic checksum would have enabled not just content-addressing-storage but also inline verification that the downloaded DTD was actually what the document expected.

&nbsp;Just for future reference, this is how I tried to work ...

Just for future reference, this is how I tried to work my way out of the mess: blog.flotsam.nl/2011/12/dtd-resolution-be-gone.html

+1, and a very funny read. A few years ago, one of our most ...

+1, and a very funny read.

A few years ago, one of our most important production systems refused to startup after some maintainance work. It was a rock-solid, battle-tried, veteran workhorse, with downtimes per year measured in minutes (at least we thought so). It turned out that the app-server failed to validate one of those silly taglib-descriptors, because - yo man! - java.sun.com was down at the moment. The butterfly effect in chaos theory applied to the fragile world of XML.

Indeed, I wrote an XML app years ago that only after some ...

Indeed, I wrote an XML app years ago that only after some years of use started misbehaving with some users reporting it freezing for a while and throwing a strange error. I finally figured out that it was because the app was reaching out to the network to grab http://www.w3.org/2001/xml.xsd and failing. Wow. Like you, I had to write a whole bunch of boilerplate code to use CatalogManager and LSResourceResolver and stuff just to use a local copy of xml.xsd. Not a good situation.

This is crazy. Thanks for your post. I only realised that ...

This is crazy. Thanks for your post. I only realised that the pause in my app was due to this problem after reading about it here.

By the way, after some research I found this solution: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/#comment-376

SAXParserFactory factory = SAXParserFactory.newInstance();

factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

Best regards,

Keith