Skip to main content

XML Processing with Scala

Posted by cayhorstmann on May 16, 2010 at 8:31 AM PDT

A few months ago, I had one of those unpleasant format conversion jobs. I had about 1,000 multiple choice questions in RTF format and needed to import them into Moodle.

RTF is, as file formats go, somewhere between the good and the evil. It looks like one should be able to write a parser for it, but that seems like a dreary task. The miracle of open source came through for me, though, in the rtf2xml project. Paul Tremblay authored a converter that faithfully converts RTF to XML, where you can process it with your usual XML tool chain. I just love it when someone else's labor saves me many hours of drudgery. Thanks, Paul—if we ever meet, I will gladly buy you a beer :-)

My first inclination was to use XSLT to transform the result into Moodle XML format. But I quickly realized that I would have gone insane in the process.

The XML was a festering mess, because it truthfully reflected the festering mess in the RTF files. The RTF files were, of course, produced from a Microsoft Word document. Apparently, few people know how to use Microsoft Word in an intelligent way, with character and paragraph styles. The authors of my files were no exception—they treated Word as a glorified IBM Selectric typewriter.

Monospace text was expressed in four different ways, spaces inside code were styled as Times New Roman, and sequences of code lines were never grouped into anything resembling a “preformatted” entity.

I remembered that Scala has XML has a built-in type, and I figured anything is better than using org.w3c.dom (which always seemed to me like eating soup with a fork). So I built my converter with Scala, and I am glad I did, particularly when the conversion problems got messier than I had at first anticipated.

In Scala, you can express XML natively, like this:

val lineOfCode = <code>println("Hello, World!")</code>

More importantly, you can “interpolate” Scala expressions, using braces:

val command = "println(\"Hello, World\");
val lineOfCode = <code>{command}</code>

In fact, since the Scala expression can again contain XML, you can go back and forth between Scala and XML a couple of times, which sounds weird, but is actually useful.

This page by Burak Emir, the author of the Scala XML library, has a nice overview that was a bit more comprehensive than what you can find in this or this otherwise admirable book. Here is what I needed to know:

  • You use overloaded \ and \\ for XPath expressions (since // is already used for comments :-))
  • I encountered NodeSeq items all the time. These are XML fragments such as <p>Behold this     code:</p><pre>println("Hello,     World!")</pre>. You can take them apart with for (s <- seq) in the usual way. To build them, I used code like this:
       for (p <- doc \ "para")
          yield cleanupPara(p, code) // cleanupPara returns a <p> element
  • Attribute handling was a bit of a hassle since attributes can contain entitites. I just used tests of the form
    if (inline.attribute("italics").getOrElse("").toString() == "true")

250 lines of code (some of which admittedly look like random line noise) solved my problem. What made it simple and fun is the Scala REPL. I experimented with various queries and transforms in the REPL, and whenever one of them worked, I pasted it into my program.

I didn't think I would ever need this again, but a few months later, my publisher said that we really needed to get rid of FrameMaker for the next edition of Core Java. The Safari source of the book was surprisingly rational XML (unlike the bizarre XML dialect used by my other publisher). I could have edited it with something like XMLMind, but I don't think it's a good use of my time to fuss with a proprietary XML dialect. I suggested converting it to XHTML, using divs and styles as necessary to keep the structure. With XHTML, I have my choice of editors (my current favorite is Amaya), and I can use PrinceXML to make PDFs for reviewers. The final book production will be handled by an XML shop that knows how to turn just about any XML into a printed book.

My graduate student Swathi Vegesna happened to ask if I had some work for her, so I suggested her to write a Scala program for this translation. She had no guidance other than my other Scala program, Burak Emir's documentation, and, of course, Google. I was a little doubtful whether this was going to work out, but within a couple of weeks of work she produced about 450 lines of code that did the job.

Interestingly, she didn't use for (s <- seq) yield expr but the more functional seq.flatMap(s => expr), which she must have discovered independently or found through Google. Good thing she didn't Google for flatMap—the first hit is this article: “Coming straight from the menacing jungles of category theory and the perplexing wasteland of monads, flatMap is both intriguing and apparently useless”.

If you ever need to convert one dialect of XML to another, check out the Scala XML library! Play around in the REPL. Suck your document in with

val doc = ConstructingParser.fromFile(new File(filename), true).document

Take it apart with some query (doc \\ "someElement"), write a simple function to clean up some pieces, and try that in the REPL. It's perplexing at first, but the syntax is surprisingly powerful and effective.

Scala prides itself in being a great host language for specialized tasks, and I think it succeeds very well with XML processing. In fact, it is much better than XSLT which was custom-built for the job. I felt a sense of great relief when I realized that might never again have to write <xsl:apply-templates select="@* | node()"/>. With a smile, I moved my XSLT bible to the far end of my bookshelf.

Related Topics >>

Comments

Hi, Just to mention that the

Hi, Just to mention that the for/yield is just a syntax sugar the gets translated into calls to map and flatMap (among others). Cheers