CSS4J
Resolver

XML parsing in Java with DefaultEntityResolver

Overview

XML parsing should be done in a way that avoids XXE security vulnerabilities. For the Java™ language, the advice found on the Internet is generally based on applying at least one of the following (see for example OWASP's XML External Entity Prevention Cheat Sheet):

  1. Disabling the http://apache.org/xml/features/nonvalidating/load-external-dtd feature. This results in the loss of XML character entities that the document could contain, like "é". (Note: predefined entities like "&" are not affected)
  2. Enabling the feature http://apache.org/xml/features/disallow-doctype-decl, which throws an error if the parsed document contains a DOCTYPE declaration. Since many documents contain DOCTYPE declarations, that prevents the parsing of a lot of documents.

Those two workarounds assume that your XML parser is based on Apache Xerces2, although other parsers are sometimes still in use (for example variants of the Ælfred XML Parser) in which case you cannot apply them.

There are many internet pages explaining how to apply the above configurations, yet none alerts about the very real possibility of data loss with (1.): the entire entity is silently wiped out. If your use case involves a Xerces-based parser and you are completely sure that none of your documents contains XML entities, then you could apply (1.); and if you only care about documents without a DOCTYPE, could use (2.). Otherwise, you may want to look for an alternative like the one described here.

DefaultEntityResolver

The xml-dtd project (which is a small set of code that does not require the main CSS4J) provides the DefaultEntityResolver class, which you can use to parse your document without losing your XML entities.

The resolver alone cannot protect your XML parser from XML entity expansion attacks so, as will be seen later, you have to use a parser that enables FEATURE_SECURE_PROCESSING. Once that is done, the DefaultEntityResolver can filter other threats like the access to local resources, or jar: decompression bombs like:

<!DOCTYPE doc PUBLIC "-//W3C//DTD FOO 1.0//EN" "jar:http://www.example.com/evil.jar!/file.dtd">

By default, DefaultEntityResolver is configured to not attempt network connections and use its own set of pre-loaded DTDs instead. If you are using a customized DTD from a specific host, you can whitelist that host so connections to it are allowed (although even in that case, if the resolver decides that the connection does not look like pointing to a legitimate DTD, shall disallow it). You can also subclass the resolver and allow loading specific DTD files from the classpath.

Please read the DefaultEntityResolver javadoc for more information about using the resolver.


How to apply it

Before trying to use it, first you must protect your XML parser against DoS attacks based on entity expansion/recursion, by setting the FEATURE_SECURE_PROCESSING feature —see SAXParserFactory.setFeature(String, boolean). Which is what the following example does:

import io.sf.carte.doc.xml.dtd.DefaultEntityResolver;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.SAXException;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;

// Obtain and configure a SAXParserFactory
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
try {
    parserFactory.setFeature(javax.xml.XMLConstants.FEATURE_SECURE_PROCESSING, true);
} catch (SAXNotRecognizedException | SAXNotSupportedException e) {
    // Beware: old parsers do not recognize FEATURE_SECURE_PROCESSING!
    throw new IllegalStateException(e);
}

// Obtain the SAXParser and the XMLReader
javax.xml.parsers.SAXParser parser = parserFactory.newSAXParser();

org.xml.sax.XMLReader reader = parser.getXMLReader();

// Set the EntityResolver
DefaultEntityResolver resolver = new DefaultEntityResolver();
reader.setEntityResolver(resolver);

Then you can proceed and parse your document with that XMLReader.


Usage with XMLDocumentBuilder

Using your own XMLReader to parse XML can be complicated, and to simplify the process you may want to use CSS4J's XMLDocumentBuilder. In that case you'd be using the following snippet instead of the above:

import io.sf.carte.doc.dom.XMLDocumentBuilder;
import io.sf.carte.doc.xml.dtd.DefaultEntityResolver;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.xml.sax.InputSource;

/*
 * Obtain and configure a DOMImplementation and a DocumentBuilder
 */
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementation domImpl = registry.getDOMImplementation("XML 3.0");
XMLDocumentBuilder builder = new XMLDocumentBuilder(domImpl);

// We generally do not want element content whitespace
builder.setIgnoreElementContentWhitespace(true);

// Set the EntityResolver
DefaultEntityResolver resolver = new DefaultEntityResolver();
builder.setEntityResolver(resolver);

// Parse the document
java.io.Reader re = ... [obtain the document]
InputSource source = new InputSource(re);
Document document = builder.parse(source);
re.close();

Note that XMLDocumentBuilder sets FEATURE_SECURE_PROCESSING by default.