Search |
||
A spoonful of ScalaPosted by cayhorstmann on August 11, 2009 at 9:22 PM PDT
I write my lecture slides in XHTML, using the marvelous HTML Slidy package. I just dump the images into the same directory as the HTML files, which isn't so smart because it makes it hard to copy a presentation from one directory to another. I could change my habit, but hey, what is technology for? A couple of years ago I decided to write a script that simply generates a list of all images in an HTML file, so I can run cp `images 01-intro.html` somewhere
Piece of cake, right? Just look for Wise words indeed. I could spend a long time fussing with problems such as
<xsl:stylesheet version = '1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:html="http://www.w3.org/1999/xhtml">
<xsl:output method="text"/>
<xsl:template match="html:img">
<xsl:value-of select="@src"/>
<xsl:text></xsl:text>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:apply-templates select="@* | node()"/>
</xsl:template>
</xsl:stylesheet>
Do not ask me about it. I do not want the pain to recur. It worked fine for a couple of years, but this morning it broke. java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd WTH? I pasted the URL into my browser, and it worked just fine. Next, I tried telnet. $ telnet www.w3c.org 80 Trying 128.30.52.45... Connected to dolph.w3.org. Escape character is '^]'. GET /TR/xhtml1/DTD/xhtml1-strict.dtd HTTP/1.0 HTTP/1.1 503 Service Unavailable due to Unknown abuse from requesting IP ... <h1>Forbidden due to abuse</h1> <p>We are most interested in finding the source of this particular abuse. Please <a href="mailto:web-human+unknown-abuse@w3.org">contact us</a> if you have any details as to the client software running (browser, web crawler, other), what it was requesting, who your provider is or are willing for us to follow up with you and try to get details.</p> ... Connection closed by foreign host. Apparently, the W3C decided to crack down on programs that just fetch a DTD from its server. If the user agent is Java, not Mozilla, you get an Error 503. Of course, I don't actually need the DTD. No problem, I just use one of these magic incantations for the parser, or maybe the parser factory, or the parser factory configuration, or the parser factory configuration assembly—as the designers of the SAX API demonstrate so vividly, any problem in computer science can be amplified with another level of indirection.
Except I didn't write a program that used the SAX API—I am no Evel Knievel. I just invoked Xalan on the command line. And I do not have the intestinal fortitude for figuring out its command line options. Then I remembered that Scala can process XML natively. My first attempt failed miserably: val x = scala.xml.XML.loadFile("01-intro.html")
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
Ugh, the Scala library uses Xerces, just like Xalan does. But they have another parser, and that one worked fine. val doc = ConstructingParser.fromFile(new File("01-intro.html"), true).document
doc \\ "img" \\ "@src" foreach println
Two lines of Scala made the medicine go down...
So, what is the moral of all this?
»
Related Topics >>
Java Tools Comments
Comments are listed in date ascending order (oldest first)
Submitted by fabriziogiudici on Wed, 2009-08-12 01:23.
But couldln't you just install the other parser in Java too? Or is it made for Scala only?
In any case, this is scary and I'm not thinking of Java. It's that we are getting used to architectures where everything is distributed and available on demand, while forgetting that the network can fail; so, we are introducing failure points everywhere. BTW, this kind of error could be easily induced by a DoS attack against the W3C. The proper solution is to have local caches.
Submitted by cayhorstmann on Wed, 2009-08-12 05:27.
Fabrizio: I know the right solution is to cache the DTDs. Tell the Xalan folks! They should have an option for specifying a cache directory. As far as I can tell, the Xalan way for a local cache is to implement a resolver in Java, put it onto the class path, and pass the class name with the -URIRESOLVER command line option. That's when I decided to cut my losses and use Scala.
Submitted by scotty69 on Wed, 2009-08-12 05:39.
Similar story: I had to do some maintenance on a very old Tomcat installation which has been up and running for years. A few seconds restart, Tomcat exploded in midair. It turned out that he was eagerly trying to validate some taglib descriptors, and java.sun.com was down at the moment. Really scaring.
Submitted by vieiro on Wed, 2009-08-12 07:25.
It seems this is not new (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic)
You probably need a good XML Catalog somewhere (if you use Linux then apt-get install w3c-dtd-xhtml otherwise get a zip file with all those DTDs/XML Schemas and set it up).
And then take a look at http://www.sagehill.net/docbookxsl/UseCatalog.html for details on how to set up your stuff to use that catalog.
You'll save lots of time when validating your stuff. It really pays off the effort.
Submitted by cayhorstmann on Wed, 2009-08-12 10:09.
@vieiro: Thanks for the tip! That worked great, even though ironically the Ubuntu package's catalog.xml has a system ID of "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd" that gave a DNS error since apparently that organization no longer exists. I fixed it to a local file (https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-xhtml/+bug/400259)
It is a little disconcerting that the command line to invoke Xalan is longer than the Scala program.
Submitted by vieiro on Wed, 2009-08-12 10:51.
Next tip: Add that catalog to your NetBeans catalog set!
http://sinewalker.wordpress.com/2009/05/12/registering-local-dtds-or-xml...
(Note: Latest NB releases allow you to add your XML Catalog file to the set of internal XML Catalog files).
OTH: I think the W3C should deliver some downloads with different DTDs and Schemas they support. After all it's not very fair they complaining about abuse and bandwidth consumption while not having a single page where you can download DTDs and XML Schemas for all those different specifications.
Submitted by fabriziogiudici on Wed, 2009-08-12 22:55.
@Cay: I know that you know, of course :-) Mine was a generic consideration as I see around many systems just relying on the network availability.
That's really great. XPath
Submitted by guymac on Mon, 2009-08-31 18:35.
That's really great. XPath would have worked too. `xpath 01-intro.html 'img@src'`
|
||
|
|