XML Schema Inference

XML is a flexible markup language often used for data exchange. Deephaven can infer table schemas from XML files, enabling structured ingestion from a variety of XML data sources.

XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accommodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).

Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that are needed for the table.

When to use XML schema inference

Use XML schema inference when:

  • Ingesting tabular or record-based data from XML files.
  • Working with simple XML structures (one element per record).
  • Automating schema creation for new XML data sources.
  • For complex, hierarchical XML, consider custom importers.

Input requirements

  • XML file should contain one element per record for best results.
  • Use element values or attributes for data fields (see arguments for options).
  • For complex or hierarchical XML, custom importers may be required.
  • Ensure your XML is well-formed and UTF-8 encoded.

Example

Generate a schema from an XML file:

This example creates XMLExampleNamespace.XMLExampleTableName.schema in the /tmp directory:

iris_exec xml_schema_creator -- --namespace XMLExampleNamespace --tableName XMLExampleTableName --sourceFile /data/sample.xml --elementType Record --useAttributeValues --schemaPath /tmp

Where:

  • --namespace: The namespace for the new schema.
  • --tableName: The name of the table to create.
  • --sourceFile: Path to your XML file.
  • --elementType: The XML element type that represents a record.
  • --useAttributeValues: Use attribute values as columns in the table.
  • --schemaPath: Output directory for the generated schema file.

Command Reference

iris_exec xml_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the XML schema creator:

ArgumentDescription
-ns or --namespace <namespace>(Required) The namespace to use for the new schema.
-tn or --tableName <name>(Required) The table name to use for the new schema.
-sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
-sf or --sourceFile <file name or file path and name>(Required) The name of the XML file to read. This file must have a header row with column names.
-xi or --startIndexStarting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found.
-xd or --startDepthUnder the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import.
-xm or --maxDepthStarting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name.
-xt or --elementTypeThe name or path of the element that will contain data elements. This is case-sensitive.
-ev or --useElementValuesIndicates that field values will be taken from element values; e.g., <Price>10.25</>
-av or --useAttributeValuesIndicates that field values will be taken from attribute values; e.g., <Record ID="XYZ" Price="10.25" />
-pv or --namedValuesPositional Values: When omitted, field values within the document will be named; e.g., a value called Price might be contained in an element named Price, or an attribute named Price. When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If not provided, the importer will default to "Date" for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column "source_date", specify "source_date" with this option (this is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisXML" will be used.
-sl or --skipHeaderLines <integer value>Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
-fl or --setSkipFooterLines <integer value>Optional number of footer lines to skip from the end of the file.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data in the XML. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.

Troubleshooting

  • Malformed XML: Ensure your XML file is well-formed and valid. Use an XML validator if needed.
  • No records detected: Check that the elementType argument matches the record element in your XML structure.
  • Missing columns: Consider using both element values and attribute values as columns. Review your XML structure and adjust arguments accordingly.
  • Encoding issues: Ensure your XML file is UTF-8 encoded.
  • Complex hierarchy: For deeply nested or multi-table XML, consider writing a custom importer or preprocessing the XML to a simpler structure.