Runtime2 ToDos

Overview

We have built an initial DFDL-to-C backend and code generator for Apache Daffodil. Currently the C code generator can support binary boolean, integer, and real numbers, arrays of simple and complex elements, choice groups using dispatch/branch keys, validation of "fixed" attributes, and padding of explicit length complex elements with fill bytes. We plan to continue building out the C code generator until it supports a minimal subset of the DFDL 1.0 specification for embedded devices.

We are using this document to keep track of some changes requested by reviewers so we don’t forget to make these changes. If someone wants to help (which would be appreciated), please let the dev list know in order to avoid duplication.

Report hanging problem running sbt (really dev.dirs) from MSYS2 on Windows

We need to open a issue with a reproducible test case in the dev.dirs/directories-jvm project on GitHub. Note that dev.dirs exhibits the problem but they may or may not be responsible for it. Their code which tries to run a Windows PowerShell script using a Java subprocess call hangs when run from MSYS2 on Windows although it works fine when run from CMD on Windows. Then we need to wait until the hanging problem is fixed in the directories library, coursier picks up the new directories version, sbt picks up the new coursier version, and daffodil picks up the new sbt version, before we can remove the "echo >> $GITHUB_ENV" lines from .github/workflows/main.yml.

Reporting data/schema locations in errors

We have replaced error message strings with error structs everywhere now. However, we may need to expand the error struct to include a pointer (pstate/ustate for data position) and another pointer (ERD or static context object for schema filename/line number).

We also may want to implement error logging variants that both do and don’t humanize the errors, e.g., a hardware/FPGA-type implementation might just output numbers and an external tool might have to "humanize" these numbers using knowledge of the schema and runtime data objects, like an offline log processor does.

Recovering after errors

As we continue to build out runtime2, we may need to distinguish more types of errors and allow backtracking and retrying. Right now we handle only parse/unparse and validation errors in limited ways. Parse/unparse errors abort the parsing/unparsing and return to the caller immediately without resetting the stream’s position. Validation errors are collected in an array and printed after parsing or unparsing. The only places where there are calls to stop the program are in daffodil_main.c (top-level error handling) and stack.c (empty, overflow, underflow errors which should never happen).

Most of the parse functions set pstate→error only if they couldn’t read data into their buffer due to an I/O error or EOF, which doesn’t seem recoverable to me. Likewise, the unparse functions set ustate→error only if they couldn’t write data from their buffer due to an I/O error, which doesn’t seem recoverable to me.

Only the parse_endian_bool functions set pstate→error if they read an integer which doesn’t match either true_rep or false_rep when an exact match to either is required. If we decide to implement backtracking and retrying, they should call fseek to reset the stream’s position back to where they started reading the integer before they return to their callers. Right now all parse calls are followed by if statements to check for error and return immediately. The code generator would have to generate code which can advance the stream’s position by some byte(s) and try the parse call again as an attempt to resynchronize with a correct data stream after a bunch of failures.

Note that we actually run the generated code in an embedded processor and call our own fread/frwrite functions which replace the stdio fread/fwrite functions since the C code runs bare metal without OS functions. We can implement fseek but we should have a good use case.

Javadoc-like tool for C code

We should consider adopting one of the javadoc-like tools for C code and structuring our comments that way.

Validate "fixed" values in runtime1 too

If we change runtime1 to validate "fixed" values like runtime2 does, then we can resolve DAFFODIL-117.

Improve TDML Runner

We want to improve the TDML Runner to make it easier to run TDML tests with both runtime1 and runtime2. We want to eliminate the need to configure a daf:tdmlImplementation tunable in the TDML test using 12 lines of code.

I had an initial idea which was that the TDML Runner could run both runtime1 and runtime2 automatically (in parallel or serially) if it sees a TDML root attribute saying defaultImplementations="daffodil daffodil-runtime2" or a parser/unparseTestCase attribute saying implementations="daffodil daffodil-runtime2". To make running the same test on runtime1/runtime2 easier we also could add an implementation attribute to tdml:errors/warnings elements saying which implementation they are for and tell the TDML Runner to check errors/warnings for runtime2 as well as runtime1.

Then I had another idea which might be easier to implement. If we could find a way to set Daffodil’s tdmlImplementation tunable using a command line option or environment variable or some other way to change TDML Runner’s behavior when running both "sbt test" and "daffodil test" then we could simply run "sbt test" or "daffodil test" twice (first using runtime1 and then using runtime2) in order to verify all the cross tests work on both. I think this way would be easier than making TDML Runner automatically run all the implementations it can find in parallel or serially when running cross tests.

If the second idea works as I hope it does, then we can start the process of adding "daffodil-runtime2" to some of the cross tests we have for daffodil and ibm. We also chould change ibm’s ProcessFactory class to have a different name than daffodil’s ProcessFactory class and update TDML Runner’s match expression to use the new class name. Then some developers could add the ibmDFDLCrossTester plugin to their daffodil checkout permanently instead of having to do & undo that change each time they want to run daffodil/ibm cross tests.

C struct/field name collisions

To avoid possible name collisions, we should prepend struct names and field names with namespace prefixes if their infoset elements have non-null namespace prefixes. Alternatively, we may need to use enclosing elements' names as prefixes to avoid name collisions without namespaces.

Anonymous/multiple choice groups

We already handle elements having xs:choice complex types. In addition, we should support anonymous/multiple choice groups. We may need to refine the choice runtime structure in order to allow multiple choice groups to be inlined into parent elements. Here is an example schema and corresponding C code to demonstrate:

  <xs:complexType name="NestedUnionType">
    <xs:sequence>
      <xs:element name="first_tag" type="idl:int32"/>
      <xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}">
        <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
        <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
      </xs:choice>
      <xs:element name="second_tag" type="idl:int32"/>
      <xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}">
        <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
        <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
      </xs:choice>
    </xs:sequence>
  </xs:complexType>
typedef struct NestedUnion
{
    InfosetBase _base;
    int32_t     first_tag;
    size_t      _choice_1; // choice of which union field to use
    union
    {
        foo foo;
        bar bar;
    };
    int32_t     second_tag;
    size_t      _choice_2; // choice of which union field to use
    union
    {
        fie fie;
        fum fum;
    };
} NestedUnion;

Choice dispatch key expressions

We currently support only a very restricted and simple subset of choice dispatch key expressions. We would like to refactor the DPath expression compiler and make it generate C code in order to support arbitrary choice dispatch key expressions.

No match between choice dispatch key and choice branch keys

Right now c-daffodil is more strict than scala-daffodil when unparsing infoset XML files with no matches (or mismatches) between choice dispatch keys and branch keys. Perhaps c-daffodil should load such an XML file without a no match processing error and unparse the infoset to a binary data file without a no match processing error. We would have to code and call a choice branch resolver in C which peeks at the next XML element, figures out which branch does that element indicate exists inside the choice group, and initializes the choice and element runtime data (_choice and childNode→erd member fields) accordingly. We probably would replace the initChoice() call in walkInfosetNode() with a call to that choice branch resolver and we might not need to call initChoice() in unparseSelf(). When I called initChoice() in all these parse, walk, and unparse places, I was pondering removing the _choice member field and calling initChoice() as a function to tell us which element to visit next, but we probably should have a mutable choice runtime data structure that applications can override if they want to.

Floating point numbers

Right now runtime2 prints floating point numbers in XML infosets slightly differently than runtime1 does. This means we may need to use different XML infosets in TDML tests depending on the runtime implementation. In order to use the same XML infoset in TDML tests, we should make the TDML Runner compare floating point numbers numerically, not textually, as discussed in DAFFODIL-2402.

Arrays

Instead of expanding arrays inline within childrenERDs, we may want to store a single entry for an array in childrenERDs giving the array’s offset and size of all its elements. We would have to write code for special case treatment of array member fields versus scalar member fields but we could save space/memory in childrenERDs for use cases with very large arrays. An array element’s ERD should have minOccurs and maxOccurs where minOccurs is unsigned and maxOccurs is signed with -1 meaning "unbounded". The actual number of children in an array instance would have to be stored with the array instance in the C struct or the ERD. An array node has to be a different kind of infoset node with a place for this number of actual children to be stored. Probably all ERDs should just get minOccurs and maxOccurs and a scalar is just one with 1, 1 as those values, an optional element is 0, 1, and an array is all other legal combinations like N, -1 and N, and M with N⇐M. A restriction that minOccurs is 0, 1, or equal to maxOccurs (which is not -1) is acceptable. A restriction that maxOccurs is 1, -1, or equal to minOccurs is also fine (means variable-length arrays always have unbounded number of elements).

Daffodil module/subdirectory names

When Daffodil is ready to move from a 3.x to a 4.x release, rename the modules to have shorter and easier to understand names as discussed in DAFFODIL-2406.