for
-comprehensions. An SQL
query like this:
SELECT foo FROM bar WHERE condition
for
-comprehension that looks
somewhat like this:
for (foo <- bar; if condition) yield { foo }
for
-comprehensions to construct JDBC queries against
relational databases. The same idea is also the motivation behind
Microsoft's LINQ technology. So we are good company if we use the same
concepts to query our XML documents.«
A Quick Recap Of Scala's XML Query functionality
The NodeSeq
class provies two query or projection
functions: \
and \\
, that both take a string
as argument. The string can either be an element name to match, the
wildcard "_"
that matches any element, or a string
starting with @
, which matches an attribute. The
difference between \
and \\
is that the
former matches only direct elements of the NodeSeq
, while
the latter also return matches in all nested subsequences, similar to
the distinction between /
and //
in XPath.
So to translate an XPath-query that involves no conditions is a simple matter of flipping all slashes and quoting all names:
//div/p/a
xml
)
xml \\ "div" \ "p" \ "a"
for (node <- xml \\ "div" \ "p" \ "a") yield {
node
}
Querying With Comprehensions
But what only want the links in the sidebardiv
? The
XPath query for that is
//div[@id='sidebar']/p/a
for {
div <- xml \\ "div"
if (div \ "@id").text == "sidebar"
node <- div \ "p" \ "a"
} yield { node }
div
elements in the
document, the second filters the result down to those that have an
id
attribute with the text "sidebar" (which should be
only one in a wellformed xml document) and the final line collects all
the direct a
children of the direct p
children of those elements. This encoding has two immediately obvious
disadvantages relative to the XPath encoding: for one it is quite a
bit longer (but then it does already contain all the code to apply
the query to a document and extract the result, which is missing in
the bare XPath query), and, more importanly, it forces the programmer
to choose names for several of the intermediate steps of the search.
But it also has some distinctive advantages. One is that queries are
checked at compile time, while XPath queries are usually only checked
once the program is run. And depending on the error and the XPath
engine used, a systax error in an XPath query may only manifest itself
silently as an unexpectedly empty result set.
Abstraction Over Queries
But the real advantage of the fact that the query syntax is directly embedded into a full programming language is that you have the full power of scala available to formulate your queries and to abstract out parts of a query.Consider for example a rather common problem: find all elements o a given type that have a certain class attribute. This is actually non-trivial, people have written whole articles just about this problem. The best solution is
//div[contains(concat(' ',normalize-space(@class),' '),' foo ')]
def hasClass(cls : String) = {
s"[contains(concat(' ',normalize-space(@class),' '),' $cls ')]"
}
In scala, you would also define a helper method (probably on an implicit helper class that collects all your relevant query methods) that implements the class check:
import scala.xml.Node
implicit class RichNode(val xml: Node) extends AnyVal {
def hasClass(cls : String) = {
(xml \ "@class").text.split("\\s+").contains(cls)
}
}
for {
div <- xml \\ "div" if div hasClass "foo"
} yield { div }
Excercise
As said above,for
-comprehensions can be seen as just
another syntax to describe arbitrary queries. So use the same
techniques used in scala-query to construct XPath queries from
comprehensions, to save the user of your library from concatenating
strings by hand.