XPath-like Queries Aginst XML Documents in Scala

Scala comes with a powerful, purely functional xml library, which supports xml matching functionality inspired by XPath. On a first look, it seem to be quite restricted compared to XPath, as you can only query for tag and attribute names, without adding conditions on tag content or attribute values, and consequently [Ext. Link]several [Ext. Link]introductions to this functionality deplore its limited power. But this is misguided: all the matchers return NodeSeqs, which implement all the methods required for scala's monadic for-comprehension syntax, and monads are a natural encoding for queries. (And, unlike XPath expressions embedded as strings in a host language, the resulting queries are not even stringly typed!)
If you think about it for a moment, you will realize that there is a direct syntactic correspondence between queries in any common query language and scala's for-comprehensions. An SQL query like this:
SELECT foo FROM bar WHERE condition
corresponds directly to a for-comprehension that looks somewhat like this:
for (foo <- bar; if condition) yield { foo }
And in fact, [Ext. Link]slick (neé scala-query) uses this exact idea to use for-comprehensions to construct JDBC queries against relational databases. The same idea is also the motivation behind Microsoft's LINQ technology. So we are good company if we use the same concepts to query our XML documents.«

A Quick Recap Of Scala's XML Query functionality

The NodeSeq class provies two query or projection functions: \ and \\, that both take a string as argument. The string can either be an element name to match, the wildcard "_" that matches any element, or a string starting with @, which matches an attribute. The difference between \ and \\ is that the former matches only direct elements of the NodeSeq, while the latter also return matches in all nested subsequences, similar to the distinction between / and // in XPath.

So to translate an XPath-query that involves no conditions is a simple matter of flipping all slashes and quoting all names:

//div/p/a
translates to the equivalent scala query (assuming the document is in variable xml)
xml \\ "div" \ "p" \ "a"
or equivalently (but in this case unnecessarily)
for (node <- xml \\ "div" \ "p" \ "a") yield {
  node
}

Querying With Comprehensions

But what only want the links in the sidebar div? The XPath query for that is
//div[@id='sidebar']/p/a
Translating it to a comprehension is straightforward:
for {
  div <- xml \\ "div"
  if (div \ "@id").text == "sidebar"
  node <- div \ "p" \ "a"
} yield { node }
The first line collects all the div elements in the document, the second filters the result down to those that have an id attribute with the text "sidebar" (which should be only one in a wellformed xml document) and the final line collects all the direct a children of the direct p children of those elements. This encoding has two immediately obvious disadvantages relative to the XPath encoding: for one it is quite a bit longer (but then it does already contain all the code to apply the query to a document and extract the result, which is missing in the bare XPath query), and, more importanly, it forces the programmer to choose names for several of the intermediate steps of the search. But it also has some distinctive advantages. One is that queries are checked at compile time, while XPath queries are usually only checked once the program is run. And depending on the error and the XPath engine used, a systax error in an XPath query may only manifest itself silently as an unexpectedly empty result set.

Abstraction Over Queries

But the real advantage of the fact that the query syntax is directly embedded into a full programming language is that you have the full power of scala available to formulate your queries and to abstract out parts of a query.

Consider for example a rather common problem: find all elements o a given type that have a certain class attribute. This is actually non-trivial, people have written [Ext. Link]whole articles just about this problem. The best solution is

//div[contains(concat(' ',normalize-space(@class),' '),' foo ')]
which works by performing some whitespace manipulation that reproduces the tokenization rules for class attributes. But the purpose of that code is not immediately obvious, and if you want to test for another class somewhere else, you have to repeat the code with the appropriate changes. Your best bet is probably to write a helper function like
def hasClass(cls : String) = {
  s"[contains(concat(' ',normalize-space(@class),' '),' $cls ')]"
}
and then build your queries using the tried and true abstraction mechanism of string concatenation.

In scala, you would also define a helper method (probably on an implicit helper class that collects all your relevant query methods) that implements the class check:

import scala.xml.Node

implicit class RichNode(val xml: Node) extends AnyVal {
  def hasClass(cls : String) = {
    (xml \ "@class").text.split("\\s+").contains(cls)
  }
}
With this in place, the query written in scala does actually look nicer than the plain XPath query:
for {
  div <- xml \\ "div" if div hasClass "foo"
} yield { div }

Excercise

As said above, for-comprehensions can be seen as just another syntax to describe arbitrary queries. So use the same techniques used in scala-query to construct XPath queries from comprehensions, to save the user of your library from concatenating strings by hand.
Florian Hars <florian@hars.de>, 2016-01-30 (orig: 2014-02-26)