Skip to main content

A spoonful of Scala

Posted by cayhorstmann on August 11, 2009 at 9:22 PM PDT

I write my lecture slides in XHTML, using the marvelous HTML Slidy package. I just dump
the images into the same directory as the HTML files, which isn't so smart
because it makes it hard to copy a presentation from one directory to another.
I could change my habit, but hey, what is technology for? A couple of years ago
I decided to write a script that simply generates a list of all images in an
HTML file, so I can run

cp `images 01-intro.html` somewhere

Piece of cake, right? Just look for

<img src="foo.jpg"
.../>
. Now I could just use a regular expression. But, as Jamie Zawinski said, “Some people, when
confronted with a problem, think ‘I know, I'll use regular
expressions.’ Now they have two problems.”

Wise words indeed. I could spend a long time fussing with problems such as
<img (newline) src=. Of course, the right thing to
do is to use an XML parser, and the manly thing to do is to use XSLT. After
more pain than seemed warranted, I came up with this XSLT script.

<xsl:stylesheet version = '1.0'
      xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
      xmlns:html="http://www.w3.org/1999/xhtml">
   <xsl:output method="text"/>
   <xsl:template match="html:img">
      <xsl:value-of select="@src"/>
      <xsl:text></xsl:text>
   </xsl:template>
   <xsl:template match="@* | node()">
      <xsl:apply-templates select="@* | node()"/>
   </xsl:template>
</xsl:stylesheet>

Do not ask me about it. I do not want the pain to recur.

It worked fine for a couple of years, but this morning it broke.

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

WTH? I pasted the URL into my browser, and it worked just fine. Next, I
tried telnet.

$ telnet www.w3c.org 80
Trying 128.30.52.45...
Connected to dolph.w3.org.
Escape character is '^]'.
GET /TR/xhtml1/DTD/xhtml1-strict.dtd HTTP/1.0

HTTP/1.1 503 Service Unavailable due to Unknown abuse from requesting IP
...
<h1>Forbidden due to abuse</h1>

<p>We are most interested in finding the source of this particular
abuse.  Please <a href="mailto:web-human+unknown-abuse@w3.org">contact
us</a> if you have any details as to the client software running
(browser, web crawler, other), what it was requesting, who your
provider is or are willing for us to follow up with you and try to get
details.</p>
...
Connection closed by foreign host.

Apparently, the W3C decided to crack
down on programs
that just fetch a DTD from its server. If the user agent
is Java, not Mozilla, you get an Error 503. Of course, I don't actually need
the DTD. No problem, I just use one of these magic
incantations
for the parser, or maybe the parser factory, or the parser
factory configuration, or the parser factory configuration assembly—as
the designers of the SAX API demonstrate so vividly, any problem in computer
science can be amplified with another level of indirection.

Except I didn't write a program that used the SAX API—I am no Evel Knievel. I just
invoked Xalan on the command line.
And I do not have the intestinal fortitude for figuring out its href="http://xml.apache.org/xalan-j/commandline.html">command line
options.

Then I remembered that Scala can process XML natively.

My first attempt failed miserably:

val x = scala.xml.XML.loadFile("01-intro.html")
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Ugh, the Scala library uses Xerces, just like Xalan does.

But they have another parser, and that one worked fine.

val doc = ConstructingParser.fromFile(new File("01-intro.html"), true).document
doc \\ "img" \\ "@src" foreach println

Two lines of Scala made the medicine go down...

So, what is the moral of all this?

  • The Scala interpreter is your friend. It took me a few minutes of futzing
    around to get this to work, and they were not painful minutes because I got
    positive feedback along the way, not just the familiar series of error
    messages that tell me “You suck”. Having an interpreter is
    great for experimenting with an unknown API.
  • XML is really bad for describing behavior. It should not take twelve
    lines of gobbledygook to express
    doc \\ "img" \\ "@src" foreach
        println
    .
  • Operator overloading is (gasp) not evil. I am vaguely familiar with
    XPath, and the Scala \\ looks like XPath //
    (which, for obvious reasons, they couldn't have taken verbatim :-)), so the
    learning curve was minimal.
  • Learning a new programming language is hard. You have to make yourself do
    it. I resolved to write all my little utility programs in Scala until my
    fingers know it as well as they know Java.
Related Topics >>

Comments

That's really great. XPath

That's really great. XPath would have worked too. `xpath 01-intro.html 'img@src'`

@Cay: I know that you know, of course :-) Mine was a generic consideration as I see around many systems just relying on the network availability.

Next tip: Add that catalog to your NetBeans catalog set! http://sinewalker.wordpress.com/2009/05/12/registering-local-dtds-or-xml... (Note: Latest NB releases allow you to add your XML Catalog file to the set of internal XML Catalog files). OTH: I think the W3C should deliver some downloads with different DTDs and Schemas they support. After all it's not very fair they complaining about abuse and bandwidth consumption while not having a single page where you can download DTDs and XML Schemas for all those different specifications.

@vieiro: Thanks for the tip! That worked great, even though ironically the Ubuntu package's catalog.xml has a system ID of "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd" that gave a DNS error since apparently that organization no longer exists. I fixed it to a local file (https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-xhtml/+bug/400259) It is a little disconcerting that the command line to invoke Xalan is longer than the Scala program.

It seems this is not new (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic) You probably need a good XML Catalog somewhere (if you use Linux then apt-get install w3c-dtd-xhtml otherwise get a zip file with all those DTDs/XML Schemas and set it up). And then take a look at http://www.sagehill.net/docbookxsl/UseCatalog.html for details on how to set up your stuff to use that catalog. You'll save lots of time when validating your stuff. It really pays off the effort.

Similar story: I had to do some maintenance on a very old Tomcat installation which has been up and running for years. A few seconds restart, Tomcat exploded in midair. It turned out that he was eagerly trying to validate some taglib descriptors, and java.sun.com was down at the moment. Really scaring.

Fabrizio: I know the right solution is to cache the DTDs. Tell the Xalan folks! They should have an option for specifying a cache directory. As far as I can tell, the Xalan way for a local cache is to implement a resolver in Java, put it onto the class path, and pass the class name with the -URIRESOLVER command line option. That's when I decided to cut my losses and use Scala.

But couldln't you just install the other parser in Java too? Or is it made for Scala only? In any case, this is scary and I'm not thinking of Java. It's that we are getting used to architectures where everything is distributed and available on demand, while forgetting that the network can fail; so, we are introducing failure points everywhere. BTW, this kind of error could be easily induced by a DoS attack against the W3C. The proper solution is to have local caches.