xs:anyURI and Property dfdlx:objectKindDaffodil has implemented a DFDL extension that allows data much larger than memory to be manipulated.
A variety of data formats, such as for image and video files, consist of fields of what is effectively metadata, surrounding large blocks of data containing compressed image or video data.
An important use case for DFDL is to expose this metadata for easy use, and to provide access to
the large data via a streaming mechanism akin to opening a file, thereby avoiding
large xs:hexBinary strings in the infoset.
In RDBMS systems, BLOB (Binary Large Object) is the type used when the data row returned from an SQL query will not contain the actual value data, but rather a handle that can be used to open/read/write/close the BLOB.
Daffodil has an analogous BLOB capability. This enables processing of images or video of arbitrary size without the need to ever hold all the data in memory.
This also bypasses the limitation on object size.
xs:anyURI and Property dfdlx:objectKindDFDL is extended to allow simple types to have the xs:anyURI type.
Elements with this type will be treated as BLOB objects.
The dfdlx:objectKind property is added to define what type of object it is.
The valid value for this property is only "bytes" specifying binary large objects.
All other values reserved for future extensions of this feature.
An example of this usage in a DFDL schema may look something like this:
<xs:schema
xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
<xs:element name="data" type="xs:anyURI"
dfdlx:objectKind="bytes"
dfdl:lengthUnits="bytes"
dfdl:length="1024" />
</xs:schema>
The resulting infoset (as XML) will look something like this:
<data>file:///path/to/blob/data</data>
With the 1024 bytes of data being written to a file at location /path/to/blob/data.
The BLOB URI must always use the file scheme and must be absolute.
API calls are used to specify where Daffodil should write the BLOB files.
Two functions are used on the Daffodil InfosetOutputter.
The first API function allows a way to set the properties used when creating BLOB files, including the output directory, and prefix/suffix for the BLOB file.
/**
* Set the attributes for how to create blob files.
*
* @param dir the Path the the directory to create files. If the directory
* does not exist, Daffodil will attempt to create it before
* writing a blob.
* @param prefix the prefix string to be used in generating a blob file name
* @param suffix the suffix string to be used in generating a blob file name
*/
final def setBlobAttributes(directory: Path, prefix: String, suffix: String)
The second API function allows a way for the API user to get a list of
all BLOB files that were created during parse().
/**
* Get the list of blob paths that were output in the infoset.
*
* This is the same as what would be found by iterating over the infoset.
*/
final def getBlobFiles(): Seq[Path]
Note that no changes to the unparse() API are required, since the BLOB URI provides
all the necessary information to retrieve files containing BLOB data.
BLOB files are not automatically deleted. It is the responsibility of the API user to determine when files are no longer needed and remove them.
Any expression access to the data of a BLOB element will result in a Schema Definition Error during schema compilation.
The length of a BLOB element is available since it is very common in
data formats to include both a BLOB payload and the length of that
payload. On unparse, we can calculate the length of the BLOB data so
that the value can be output in a length field in the data. This is
done using the regular dfdl:contentLength() and dfdl:valueLength()
functions.
The TDML language is extended to support the xsi:type="xs:anyURI" annotation on XML data elements.
For example:
<tdml:dfdlInfoset>
<data xsi:type="xs:anyURI">path/to/blob/data</data>
</tdml:dfdlInfoset>
The path provided as the URI value can be, and usually will be, a relative path within the
src/test/resources directory of the DFDL schema project.
During Infoset comparisons the TDML Runner will compare the contents of this file
with the BLOB file in the corresponding element (having type xs:anyURI) of the infoset.
BLOB files created when running the tests are deleted when the test completes.
The CLI supports ad-hoc testing of the use of BLOBs.
BLOBs are written to the directory given by the JVM System Property user.dir into
a subdirectory of it named daffodil-blobs.
If it does not exist, Daffodil will attempt to create the daffodil-blobs directory.
The CLI does not delete any BLOB files.