Import XML Files

This guide demonstrates how to generate a schema from an XML data file, deploy it, and import a file from the command line. These commands assume a typical Deephaven installation and a sample file located at /data/sample.xml.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Record Name="PI" Value="3.14159"/>
  <Record Name="e" Value="2.71828"/>
  <Record Name="phi" Value="1.61803"/>
</Response>

Deephaven offers robust tools for inferring table schemas from sample data and importing XML files. Unlike simpler formats like CSV, XML's ability to represent nested and hierarchical data introduces additional complexity when mapping to Deephaven tables. The XML importer outlined in this guide supports several common scenarios, including extracting values from attributes or element text and handling varying levels of nesting. However, for more intricate XML structures, a custom importer may be necessary to achieve the desired mapping.

Generate a schema

See the Schema inference page for a guide on generating a schema from an XML file like the one above.

Deploy the schema

sudo -u irisadmin /usr/illumon/latest/bin/dhconfig schema import --file /tmp/XMLExampleNamespace.XMLExampleTableName.schema

See Deephaven Schemas for further details.

Import a schema file

The following command imports the XML file into the specified intraday partition:

sudo -u irisadmin /usr/illumon/latest/bin/dhconfig schema import --file <fully qualified schema file name>

where <fully qualified schema file name> is the path to the schema file you want to import, e.g., /data/sample.xml.

XML imports can be performed directly from the command line using the iris_exec tool:

iris_exec xml_import <launch args> -- <xml import args>

XML Import Arguments

The XML importer takes the following arguments:

ArgumentDescription
  • -dd or --destinationDirectory <path>
  • -dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
  • -pcor--intradayPartition <partition column name>
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
  1. Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  2. Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  3. Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)
-ns or --namespace <namespace>(Required) Namespace in which to find the target table.
-tn or --tableName <name>(Required) Name of the target table.
-fd or --delimiter <delimiter character>Field Delimiter (Optional). Allows specification of a character to be used when parsing string representations of long or double arrays.
-xi or --startIndexStarting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found.
-xd or --startDepthUnder the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import.
-xm or --maxDepthStarting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name.
-xt or --elementTypeThe name or path of the element that will contain data elements.
-ev or --useElementValuesIndicates that field values will be taken from element values e.g., <Price>10.25</>.
-av or --useAttributeValuesIndicates that field values will be taken from attribute value, e.g., <Record ID="XYZ" Price="10.25" />.
-pv or --namedValuesWhen omitted, field values within the document will be named, e.g., a value called Price might be contained in an element named Price, or an attribute named Price. When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.
-om or --outputMode <import behavior>(Optional):
  • SAFE (default)- SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted.
  • When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort.
  • APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.
-rc or --relaxedChecking <TRUE or FALSE>(Optional) Defaults to FALSE. If TRUE, will allow target columns that are missing from the source XML to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.
  • -sd or --sourceDirectory <path>
  • -sf or --sourceFile <exact file name>
  • -sg or --sourceGlob <file name pattern>
sourceDirectory, sourceFile, and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import. Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob. If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.
-sn or --sourceName <ImportSource name>Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/XML/JSON/JDBC).

One special case when importing XML data is columns in the target table with a type of Boolean.

  • By default, the XML importer will attempt to interpret string data from the source table with 0, F, f, or any case of false, being written as a Boolean false.
  • Similarly, 1, T, t, or any case of true would be a Boolean true.
  • Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column.
  • To convert from other representations of true/false, for example from foreign language strings, a formula or transform would be needed, along with a sourceType="String" to ensure the reading of the data from the XML value is handled correctly.