Apache Daffodil | C Code Generator ToDos

Anonymous choice groups not allowed

We handle elements having xs:choice complex types. However, we don’t support anonymous choice groups (that is, an unnamed choice group in the middle, beginning, or end of a sequence which may contain other elements). A DFDL schema author may write a sequence like this:

  <xs:complexType name="NestedUnionType">
    <xs:sequence>
      <xs:element name="first_tag" type="idl:int32"/>
      <xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}">
        <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
        <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
      </xs:choice>
      <xs:element name="second_tag" type="idl:int32"/>
      <xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}">
        <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
        <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
      </xs:choice>
    </xs:sequence>
  </xs:complexType>

Daffodil will parse and unparse the above sequence fine, but the C code generator will not generate correct code (no _choice members or unions will be declared for the type). It might be possible to generate C code that looks like this:

typedef struct NestedUnion
{
    InfosetBase _base;
    int32_t     first_tag;
    size_t      _choice_1; // choice of which union field to use
    union
    {
        foo foo;
        bar bar;
    };
    int32_t     second_tag;
    size_t      _choice_2; // choice of which union field to use
    union
    {
        fie fie;
        fum fum;
    };
} NestedUnion;

However, the Daffodil devs have looked at DFDL integration for other systems like Apache Drill, NiFi, Avro, etc., and these systems generally do not allow anonymous choices. Hence, any DFDL schema having anonymous choices doesn’t integrate well with any of these systems unless we generate a child element with a generated name (which makes paths awkward, etc.). Hence, it seems better to say that codegen-c’s DFDL subset doesn’t allow anonymous choices and DFDL schema authors should write their schema like this:

  <xs:complexType name="NestedUnionType">
    <xs:sequence>
      <xs:element name="first_tag" type="idl:int32"/>
      <xs:element name="first_choice">
        <xs:complexType>
          <xs:choice dfdl:choiceDispatchKey="{xs:string(../first_tag)}">
            <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
            <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
          </xs:choice>
        </xs:complexType>
      </xs:element>
      <xs:element name="second_tag" type="idl:int32"/>
      <xs:element name="second_choice">
        <xs:complexType>
          <xs:choice dfdl:choiceDispatchKey="{xs:string(../second_tag)}">
            <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
            <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
          </xs:choice>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>

The C code generator will generate _choice members and unions for the first_choice and second_choice elements, and such a schema will integrate better with other systems too.

Replace size_t with choice_t

It has been pointed out that it is actually not obvious whether _choice should be a signed or unsigned type. One thought had been that _choice should be unsigned to avoid cutting the usable range in half and it should be size_t because size_t is the maximum allowable length of any type of C array. However, there are equally compelling reasons why indices should be signed instead of unsigned as well (https://www.quora.com/Why-is-size_t-sometimes-used-instead-of-int-for-declaring-an-array-index-in-C-Is-there-any-difference). There appears to be no One Right Answer what type _choice should have, so defining a choice_t type in only one place will allow us to change our mind if we need to although we still would need to re-evaluate every use of _choice very carefully.

Arrays

We inline an array’s minOccurs and maxOccurs into the array’s parseSelf and unparseSelf functions and throw an error if the count isn’t within these bounds. The reason why we perform these checks is because we inline arrays' maximum size into C structs and we put C structs into static memory to avoid using heap memory which may not be available on all platforms.

In the normal case, Daffodil’s Scala backend parses and unparses unbounded or well-formed arrays without enforcing min/maxOccurs for forensic analysis and easier debugging. If we want the C backend to parse and unparse unbounded arrays and well-formed arrays, we still can inline min/maxOccurs into the generated C code without enforcing their bounds. If maxOccurs is unbounded or the largest possible array size (maxOccurs - minOccurs) is larger than a tunable, we would allocate storage for the array from the heap instead of inlining the array’s storage into the C struct. If the array is small enough to inline into the C struct but the array needs more space than its inlined space, we can switch that array to heap memory at runtime as long as we track its heap/inline status in another C struct field. We still should keep inlining finite bounded arrays into C structs since some embedded systems will not be able to allocate memory from a heap dynamically.

Making infosets more efficient

Right now all of our C structs (infoset nodes) store an ERD pointer within their first field. This makes it possible to take a pointer to any infoset node and interpret the infoset node correctly in all the ways we need (walk the infoset node, unparse the infoset node to XML, etc.) because we can indirect over to the ERD to get all the static info.

In most cases, the ERD needed for a child complex element is static information of the enclosing parent’s ERD, so could be stored only in the parent’s ERD. Inductively, most infoset nodes should not need ERD pointers since the ERD "nest" up to the root is all static information. Logically, we should be able to remove ERD pointers from the first field of most C structs (infoset nodes), avoiding taking up the first field’s space multiplied by however many infoset nodes the data contains.

We probably just need to find all the places in the code where we pass a pointer to an infoset node and make these places pass both a pointer to an infoset node and a separate pointer to the infoset node’s ERD at the same time. Then we can remove the infoset node’s pointer to the same ERD since it would already be passed into all the places needed.

Remove workaround for problem running sbt (really dev.dirs) from MSYS2 on Windows

We need to open a issue with a reproducible test case in the dev.dirs/directories-jvm project on GitHub. Note that dev.dirs exhibits the problem but they may or may not be responsible for it. Their code which tries to run a Windows PowerShell script using a Java subprocess call hangs when run from MSYS2 on Windows although it works fine when run from CMD on Windows. Then we need to wait until the hanging problem is fixed in the directories library, coursier picks up the new directories version, sbt picks up the new coursier version, and daffodil picks up the new sbt version, before we can remove the "echo >> $GITHUB_ENV" lines from .github/workflows/main.yml which prevent the sbt hanging problem.

Reporting data/schema locations in errors

We have replaced error message strings with error structs everywhere now. However, we may need to expand the error struct to include a pointer (pstate/ustate for data position) and another pointer (ERD or static context object for schema filename/line number).

We also may want to implement error logging variants that both do and don’t humanize the errors, e.g., a hardware/FPGA-type implementation might just output numbers and an external tool might have to "humanize" these numbers using knowledge of the schema and runtime data objects, like an offline log processor does.

Recovering after errors

As we continue to build out codegen-c, we may need to distinguish more types of errors and allow backtracking and retrying. Right now we handle only parse/unparse and validation errors in limited ways. Parse/unparse errors abort the parsing/unparsing and return to the caller immediately without resetting the stream’s position. Validation errors are collected in an array and printed after parsing or unparsing. The only places where there are calls to stop the program are in daffodil_main.c (top-level error handling) and stack.c (empty, overflow, underflow errors which should never happen).

Most of the parse functions set pstate→error only if they couldn’t read data into their buffer due to an I/O error or EOF, which doesn’t seem recoverable to me. Likewise, the unparse functions set ustate→error only if they couldn’t write data from their buffer due to an I/O error, which doesn’t seem recoverable to me.

Only the parse_endian_bool functions set pstate→error if they read an integer which doesn’t match either true_rep or false_rep when an exact match to either is required. If we decide to implement backtracking and retrying, they should call fseek to reset the stream’s position back to where they started reading the integer before they return to their callers. Right now all parse calls are followed by if statements to check for error and return immediately. The code generator would have to generate code which can advance the stream’s position by some byte(s) and try the parse call again as an attempt to resynchronize with a correct data stream after a bunch of failures.

Note that we sometimes run the generated code in an embedded processor and call our own fread/frwrite functions which replace the stdio fread/fwrite functions since the C code runs bare metal without OS functions. We can implement the fseek function on the embedded processor too but we would need a good use case requiring recovering after errors.

No match between choice dispatch key and choice branch keys

Right now c/daffodil is more strict than daffodil when unparsing infoset XML files with no matches (or mismatches) between choice dispatch keys and branch keys. Such a situation always makes c/daffodil exit with an error, which is too strict. We should make c/daffodil load such an XML file without a no match processing error and unparse the infoset to a binary data file without a no match processing error, even if the choiceDispatchKey is invalid. The choiceDispatchKey should not be evaluated at unparse time, only at parse time. If the schema writer wants to enforce that the choiceDispatchKey is the right one matching the unparsed choice branch, the writer must write an explicit dfdl:outputValueCalc expression to replace the choiceDispatchKey even though supporting dfdl:outputValueCalc in codegen-c is likely a distant goal.

C Code Generator ToDos

Anonymous choice groups not allowed

Replace size_t with choice_t

Arrays

Making infosets more efficient

Javadoc-like tool for C code

Choice dispatch key expressions

Daffodil module/subdirectory names

Remove workaround for problem running sbt (really dev.dirs) from MSYS2 on Windows

Reporting data/schema locations in errors

Recovering after errors

Validate "fixed" values in runtime1 too

No match between choice dispatch key and choice branch keys