Daffodil and the DFDL Infoset

Daffodil is an implementation of DFDL which supports multiple methods to represent the DFDL Infoset, including various XML representations and JSON. However, the DFDL Infoset is somewhat different from the representations that Daffodil creates since Daffodil approximates the DFDL Infoset using a subset of features of XML/JSON. The below tables describe how Daffodil maps the DFDL Infoset to the supported representations.

Document Information Item org.jdom.Document
root org.jdom.Element getRootElement()
dfdlVersion not yet implemented
schema (reserved for future use) not yet implemented
unicodeByteOrderMark not yet implemented
Element Information Item org.jdom.Element
namespace org.jdom.Namespace getNamespace()
name String getName()
document org.jdom.Document getDocument()
datatype not yet implemented
dataValue For simple types other than xs:string, the canonical XML representation of the value as returned by String getText(). See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled The "nilled" attribute in the "xsi" namespace.
children java.util.List<Element> getChildren()
parent org.jdom.Parent getParent()
schema not yet implemented
valid not yet implemented
unionMemberSchema not yet implemented
"No Value" An org.jdom.Element with no children (not even Text nodes) is the representation of an element with "no value".
Augmented Infoset not yet implemented
Document Information Item org.w3c.dom.Document
root org.w3c.dom.Node getFirstChild()
dfdlVersion not yet implemented
schema (reserved for future use) not yet implemented
unicodeByteOrderMark not yet implemented
Element Information Item org.w3c.dom.Element
namespace String getNamespaceURI()
name String getNodeName() if getNamespaceURI() == null, String getLocalName() otherwise
document org.jdom.Document getOwnerDocument()
datatype not yet implemented
dataValue For simple types other than xs:string, the canonical XML representation of the value as returned by String getWholeText(). See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled The "nilled" attribute in the "xsi" namespace.
children org.w3c.dom.NodeList getChildNodes()
parent org.w3c.dom.Node getParentNode()
schema not yet implemented
valid not yet implemented
unionMemberSchema not yet implemented
"No Value" An org.w3c.dom.Element with no children (not even Text nodes) is the representation of an element with "no value".
Augmented Infoset not yet implemented
Document Information Item The document is represented by the root element. There is no separate document item.
root not supported
dfdlVersion not yet implemented
schema (reserved for future use) not yet implemented
unicodeByteOrderMark not yet implemented
Element Information Item scala.xml.Elem
namespace def namespace: String
name def name: String
document not supported
datatype not yet implemented
dataValue For simple types other than xs:string, the canonical XML representation of the value as returned by def text: String. See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled The "nilled" attribute in the "xsi" namespace.
children def child: Node*
parent not supported
schema not yet implemented
valid not yet implemented
unionMemberSchema not yet implemented
"No Value" A scala.xml.Elem with no children.
Augmented Infoset not yet implemented
Document Information Item The full text is the document.
root The first XML tag in the document.
dfdlVersion not yet implemented
schema (reserved for future use) not yet implemented
unicodeByteOrderMark not yet implemented
Element Information Item An XML tag
namespace Defined using standard XML namespacing (e.g. xmlns="..." and element prefixes)
name XML tag name
document The full text is the document
datatype not yet implemented
dataValue For simple types other than xs:string, the canonical XML representation of the value inside the opening/closing XML tags. See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled The "nilled" attribute in the "xsi" namespace.
children Child XML tags
parent Parent XML tags
schema not yet implemented
valid not yet implemented
unionMemberSchema not yet implemented
"No Value" An XML tag with no content in between the opening and closing tags
Augmented Infoset not yet implemented
Document Information Item The full text is the document, containing a JSON single object.
root The first (and only) JSON string in the doucment object.
dfdlVersion not yet implemented
schema (reserved for future use) not yet implemented
unicodeByteOrderMark not yet implemented
Element Information Item The first JSON string in an object.
namespace not supported
name The first JSON string in an object.
document The full text is the document
datatype not yet implemented
dataValue For simple types other than xs:string, the canonical XML representation of the value inside double quotes. For xs:string types, a JSON escaped string in double quotes.
nilled The value of the element is null
children Child JSON objects
parent Parent JSON tags
schema not yet implemented
valid not yet implemented
unionMemberSchema not yet implemented
"No Value" The value of the element is empty double quotes.
Augmented Infoset not yet implemented

XML Illegal Characters

Since DFDL strings can contain characters that are not allowed in XML at all, for the XML based representations, these characters are mapped into the Unicode Private Use Areas (PUA). This is similar to the scheme used by Microsoft Visio (See: https://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values. This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing, thereby allowing the creation of data streams containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.

These are the legal XML characters (for XML v1.0):

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

All other characters are illegal. Illegal characters from #x00 to #x1F are mapped to the PUA by adding #xE000 to their character code. Hence, the NUL (#x0) character code becomes #xE000.

Illegal characters from #xD800 to #xDFFF are mapped to the PUA by adding #x1000 to their character code. So #xD800 maps to #xE800, and #xDFFF maps to #xEFFF.

Illegal characters #xFFFE and #xFFFF are mapped to the PUA by subtracting #x0F00 from their character code, so to characters #xF0FE and #xF0FF.

The legal character #xD (Carriage Return or CR) is mapped to #xA (Line Feed, or LF). The CR character is allowed in the textual representation of XML documents, but is always converted to LF in the XML Infoset. That is, it is read by XML processors, but CRLF is converted to just LF, and CR alone is converted to LF. Daffodil is in a sense a different ‘reader’ of data into the XML infoset, so to be consistent with XML we map CR and CRLF to LF.

It is a processing error when parsing if the data-stream contains characters in the parts of the PUA used by this mapping for illegal XML codepoints. When unparsing, the characters such as #xE000 found in the infoset string values are mapped back to the corresponding illegal character code points (#xE000 becomes #x0, aka NUL).

The XML for an infoset can embed the #xE000 character or any of the other “illegal” characters mapped into the PUA conveniently by use of XSD numeric character entities such as “”. This is turned into the #xE000 code point when the XML document is loaded. Daffodil will then map this when unparsing, to #x0 (aka NUL).

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.