Using arbitrary SAX sources in scala 2.8
By default, scala reads XML files using the standard SAX parser that comes with your JRE. If you want to change this, you have to supply another one. Since scala 2.8, several of the methods of the scala XML framework accept a SAX parser to use, for example this method inFactoryAdapter
:
def loadXML (source: InputSource, parser: SAXParser) : Node
SAXParser
, which
is the wrapper returned by the builtin SAX parser factory that can be
configured via command line parameters or global properties when starting
the JVM.
SAXParser
is not the interface implemented by normal SAX
parsers. That interface is called XMLReader
(there is a cheap
joke about cache invalidation and off-by-one errors lurking somwhere around
here).
Some libraries give acces to a SAXParser
that you can use
directly like in
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.scala-lang.org")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
adapter.loadXML(source, parser)
XMLReader
you've got and ignore
any SAXParser
handlig code added in 2.8, like this:
import org.xml.sax.InputSource
import scala.xml._
import parsing._
class HTML5Parser extends NoBindingFactoryAdapter {
override def loadXML(source : InputSource, _p: SAXParser) = {
loadXML(source)
}
def loadXML(source : InputSource) = {
import nu.validator.htmlparser.{sax,common}
import sax.HtmlParser
import common.XmlViolationPolicy
val reader = new HtmlParser
reader.setXmlPolicy(XmlViolationPolicy.ALLOW)
reader.setContentHandler(this)
reader.parse(source)
rootElem
}
}
val html5parser = "nu.validator.htmlparser" % "htmlparser" % "1.2.1"
The old way to use a SAX parser for scala 2.7
In scala 2.7 you always had to supply a new FactoryAdapter that supplies an appropiateXMLReader
. This can be
done by implementing the abstract method getReader
in the
following trait.
import _root_.org.xml.sax.{XMLReader,InputSource}
import _root_.scala.xml.{Node,TopScope}
trait SAXFactoryAdapter extends NonBindingFactoryAdapter {
/** The method [getReader] has to implemented by
concrete subclasses */
def getReader() : XMLReader;
override def loadXML(source : InputSource) : Node = {
val reader = getReader()
reader.setContentHandler(this)
scopeStack.push(TopScope)
reader.parse(source)
scopeStack.pop
return rootElem
}
}
NonBindingFactoryAdapter
used here is a variant
of scalas standard
scala.xml.parsing.NoBindingFactoryAdapter
that has been
turned into a trait:
import _root_.scala.xml.parsing.FactoryAdapter
import _root_.scala.xml.factory.NodeFactory
import _root_.scala.xml.{Elem,MetaData,NamespaceBinding,
Node,Text,TopScope}
trait NonBindingFactoryAdapter extends FactoryAdapter
with NodeFactory[Elem] {
def nodeContainsText(localName: String) = true
// methods for NodeFactory[Elem]
/** constructs an instance of scala.xml.Elem */
protected def create(pre: String, label: String,
attrs: MetaData, scpe: NamespaceBinding,
children: Seq[Node]): Elem =
Elem( pre, label, attrs, scpe, children:_* )
// -- methods for FactoryAdapter
def createNode(pre: String, label: String,
attrs: MetaData, scpe: NamespaceBinding,
children: List[Node] ): Elem =
Elem( pre, label, attrs, scpe, children:_* )
def createText(text: String) = Text(text)
def createProcInstr(target: String, data: String) =
makeProcInstr(target, data)
}
Using a DOM parser in scala 2.7
Using a DOM parser is a little bit more tricky, as scala assumes SAX input. We have to get rid of a method inFactoryAdapter
by making it always throw an exception and replace it by an equivalent
method that operates on a DOM node.
Then we have to override all other load methods to call this method
instead.
import _root_.java.io.{InputStream, InputStreamReader, Reader,
File, FileDescriptor, FileInputStream}
import _root_.org.apache.xalan.xsltc.trax.DOM2SAX
import _root_.org.xml.sax.InputSource
import _root_.scala.xml.{Node,TopScope}
trait DOMFactoryAdapter extends NonBindingFactoryAdapter {
def getDOM(reader: Reader) : _root_.org.w3c.dom.Node
/** loading from a SAX source is useless here */
override def loadXML(unused : InputSource) : Node = {
throw(new Exception("Not Implemented"))
}
def loadXML(dom: _root_.org.w3c.dom.Node) : Node = {
val dom2sax = new DOM2SAX(dom)
dom2sax.setContentHandler(this)
scopeStack.push(TopScope)
dom2sax.parse()
scopeStack.pop
return rootElem
}
/** loads XML from given file */
override def loadFile(file: File): Node = {
val is = new FileInputStream(file)
val elem = load(is)
is.close
elem
}
/** loads XML from given file descriptor */
override def loadFile(fileDesc: FileDescriptor): Node = {
val is = new FileInputStream(fileDesc)
val elem = load(is)
is.close
elem
}
/** loads XML from given file */
override def loadFile(fileName: String): Node = {
val is = new FileInputStream(fileName)
val elem = load(is)
is.close
elem
}
/** loads XML from given InputStream */
override def load(is: InputStream): Node =
load(new InputStreamReader(is))
/** loads XML from given Reader */
override def load(reader: Reader): Node =
loadXML(getDOM(reader))
/** loads XML from given sysID */
override def load(sysID: String): Node = {
val is = new java.net.URL(sysID).openStream()
val elem = load(is)
is.close
elem
}
}
Reading HTML
To read HTML, we might as well tell scala that the empty HTML elements don't contain any text we might be interested in:import _root_.scala.xml.parsing.FactoryAdapter
trait HTMLFactoryAdapter extends FactoryAdapter {
val emptyElements = Set("area", "base", "br", "col", "hr", "img",
"input", "link", "meta", "param")
def nodeContainsText(localName: String) =
!(emptyElements contains localName)
}
getReader
methods for the sanitizing HTML-parsers we want to use.
Only two of the sanitizers compared by Ben McCann a year ago support SAX,
TagSoup and nekoHTML, so I present example code for these two.
In addition, i present code for one of the DOM parsers, HTMLCleaner.
Generalizing it to other parsers should be trivial.
TagSoup
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
class TagSoupFactoryAdapter extends SAXFactoryAdapter
with HTMLFactoryAdapter {
private val parserFactory = new SAXFactoryImpl
parserFactory.setNamespaceAware(true)
def getReader() = parserFactory.newSAXParser().getXMLReader()
}
nekoHTML
import org.cyberneko.html.parsers.SAXParser
class NekoHTMLFactoryAdapter extends SAXFactoryAdapter
with HTMLFactoryAdapter {
def getReader() = new SAXParser
}
HTMLCleaner
import _root_.java.io.Reader
import _root_.org.htmlcleaner.{HtmlCleaner,DomSerializer}
class HTMLCleanerFactoryAdapter extends DOMFactoryAdapter
with HTMLFactoryAdapter {
private val cleaner = new HtmlCleaner
private val props = cleaner.getProperties()
private val serializer = new DomSerializer(props, true)
def getDOM(reader: Reader) = {
val node = cleaner.clean(reader)
serializer.createDOM(node);
}
}
Using it
Now you can put this code into a package (I choose to call minede.hars.scalaxml
) and use it to parse some HTML files.
Here is an example session
(sorry for the line length, but the important parts are at the beginning
of the lines):
$ scala -cp build/scalaxml.jar:/usr/share/java/tagsoup-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/nekohtml.jar:/usr/share/java/htmlcleaner2_1.jar:/usr/share/java/xalan2.jar
Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import de.hars.scalaxml._
import de.hars.scalaxml._
scala> val url = "http://www.scala-lang.org"
url: java.lang.String = http://www.scala-lang.org
scala> new TagSoupFactoryAdapter load url
res0: scala.xml.Node =
<html xml:lang="en" lang="en">
<head>
<title>The Scala Programming Language</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
<link href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></link>
<link type="image/x-icon" href="/sites/default/files/favicon.gif" rel="shortcut icon"></link>
<link href="/sites/...
scala> new NekoHTMLFactoryAdapter load url
res1: scala.xml.Node =
<HTML xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<HEAD>
<TITLE>The Scala Programming Language</TITLE>
<META content="text/html; charset=utf-8" http-equiv="Content-Type"></META>
<LINK href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></LINK>
<LINK type="image/x-icon" href="/sites/default/files/favicon.gif"...
scala> new HTMLCleanerFactoryAdapter load url
res2: scala.xml.Node =
<html>
<head>
<title>The Scala Programming Language</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<link type="application/rss+xml" title="Front page feed" rel="alternate" href="/rss.xml"></link>
<link type="image/x-icon" rel="shortcut icon" href="/sites/default/files/favicon.gif"></link>
<link type="text/css" rel="stylesheet" medi...
One thing that is quite obvious is that the default configurations are problematic. TagSoup and HTMLCleaner seem to have some problems with namespaces, and nekoHTML turns every tag into uppercase. So all have problems with modern pages that are already XML. Loading a page like Sam Rubys that contains SVG gives suboptimal results. Where the page contains
<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100" viewBox="0 0 100 100">
<path d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z" fill="#47b"/>
<path d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20" fill="#47b"/>
</svg>
<svg viewbox="0 0 100 100" height="100" width="100">
<path fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z"></path>
<path fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20"></path>
</svg>
xmlns
is a
bug in scalas xml library)
<SVG viewbox="0 0 100 100" height="100" width="100" xmlns="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/1999/xhtml">
<PATH fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z"></PATH>
<PATH fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l2..."></PATH>
</SVG>
<svg width="100" viewbox="0 0 100 100" height="100">
<path fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z">
<path fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20">
</path></path></svg>
The code
Here is the source code for 2.7 as a tar.gz (you will probably have to change some paths in thebuild.xml
).