Skip to main content

Transforming an XML Tree with Scala Partial Functions

Posted by cayhorstmann on May 16, 2010 at 8:29 PM PDT

In my last blog, I outlined how I found the Scala XML library a pleasant solution for unpleasant XML format conversion jobs. In those jobs, I had to completely transform the document from one grammar to another.

When you need to make small tweaks to a document, the library a bit more of a hassle. This page by Burak Emir, the author of the Scala XML library, states: “The Scala XML API takes a functional approach to representing data, eschewing imperative updates where possible. Since nodes as used by the library are immutable, updating an XML tree can a bit verbose, as the XML tree has to be copied.” A verbose example follows.

Here is what I needed to do. Whenever I had a <div class="example"><p>Filename.java</p></div>, I had to replace it with the actual file name, with each line preceded by a line number.

That part is simple:

def getExample(node: Node) =    
  <ol>{io.Source.fromFile(new File((node \ "p").toString)).getLines().map(
    w => <li><pre>{w} </pre></li>)}</ol>

But how can you say “Do this for all <div class="example">, and leave the rest alone?”

In a functional program, you need to copy the tree, so I figured I should write a universal transformer method.

/**
* Transforms all descendants matching a predicate.
* n a node
* pred the predicate to match
* trans the transformation to apply to matching descendants
*/
def transformIf(n: Node, pred: (Node)=>Boolean, trans: (Node)=>Node): Node =
  if (pred(n)) trans(n) else
    n match {
      case e: Elem =>
        if (e.descendant.exists(pred))
          e.copy(e.prefix, e.label, e.attributes, e.scope,
            e.child.map(transformIf(_, pred, trans)))
        else e
      case _ => n
    }

The if (e.descendant.exists(pred)) part isn't strictly necessary. I just wanted to reuse nodes when there was no need for rewriting.

This solved my immediate problem.

It turned out that I needed to change some other nodes as well. I could have done two transforms, or rewritten my method to take a sequence of (predicate, transformer) pairs. But then I remembered something about partial functions in the actor library.

This blog brought me up to speed. A case expression { case ... => ...; case ... => ... } can be converted to a PartialFunction. There are methods for checking whether a value is covered by one of the cases, and for applying the function. In other words, I could trivially extend my previous method to partial functions:

def transform(n: Node, pf: PartialFunction[Node, Node]) =
  transformIf(n, pf.isDefinedAt(_), pf.apply(_)); 

Burak Emir explains how one can write case statements that check conditions with attributes. This is what it looks like.

transform(doc.docElem, {
    case node @ <div>{_*}</div> if  
      node.attribute("class").getOrElse("").toString == "example" => getExample(node)
    // Other cases go here
    case ... => ...
  })

It reads quite nicely. When you have a div whose class attribute is example, call the getExample method.

Eat your heart out, Java!

There is a larger message here. Consider again the task described in this blog, i.e. replacing <div class="example"><p>Filename.java</p></div> with <ol><li><pre>each line in that file</pre></li></ol>? Yes, I could program it in Java, but the thought makes my skin crawl.

A while ago, I resolved to use Scala for all my little processing tasks so that I would get to know it over time. It was painful at first—tasks that I know I could have completed easily in Java took some research and definitely took me out of my comfort zone. But over time, this has paid off. I can now easily do tasks in Scala that I would never have attempted in Java.

Related Topics >>

Comments

Slick.

I was wondering about using match statements and how they were actually implemented. I always enjoy your Scala posts. Scala does seem to have some very nice XML features. I wish I had more time to learn about it.

--Glenn J

"but the thought makes my skin crawl"

Really? Skin crawl? I sorry but I think you are over thinking this. Your requirements are to replace one well formed chunk of text with another. It really doesn't matter what the node structure of this XML/HTML happens to be. I can't seem to post the code here because the formatting gets messed up. You can look at it here if you are interested: https://docs.google.com/Doc?docid=0ATmm-LHfhXWbZGZkbWNzOHdfMmR0czlqYmhy&.... The important part I think it the searching code.

public String process(String source){
Pattern pattern = Pattern.compile(P_START_TAG + WHITE_SPACE + JAVA_FILE_NAME + WHITE_SPACE + P_END_TAG);
Matcher matcher = pattern.matcher(source);

StringBuffer output = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(output, createReplacement(matcher.group()));
}
matcher.appendTail(output);

return output.toString();
}

I know this code is not perfect but is this really skin crawl worthy? Even if I add I/O is it really that bad?

Collin

Code that "is not perfect"

Code that "is not perfect" doesn't really work out too well when you need to transform a book from one format to another without losing information. Can we make it perfect?

You aren't checking for &lt;div class=&quot;example&quot;&gt;, so you'd have to put that into your regex. Wait, that's &lt;\s*div\s+class\s*=\s*('example'|&quot;example&quot;)\s*&gt;

Is it a problem if there is a line break anywhere in the pattern? I suppose we can read the entire file into a string.

Did I mention that the <p> can have attributes (which can be ignored)? Ok, that would be <p(\s+('[^']'|"[^"]"))*\s*> or some such thing.

So, yes, this particular transformation can be done with regex alone, but it wouldn't be pretty. Other scenarios that I also need (such as "a div of one class anywhere inside a div of another") would be impossible.

At any rate, having been around this particular block a few times, I recall Jamie Zawinski's immortal words: "Some people, when confronted with a problem, think `I know, I'll use regular expressions.' Now they have two problems".

I would never consider using anything but an XML parser for processing XML. Between character sets and entities and comments and CDATA blocks, there are just too many things that will show up in your input XML file when you least expect it.