Import XML Files
This guide demonstrates how to generate a schema from an XML data file, deploy it, and import a file from the command line. These commands assume a typical Deephaven installation and a sample file located at /data/sample.xml
.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Record Name="PI" Value="3.14159"/>
<Record Name="e" Value="2.71828"/>
<Record Name="phi" Value="1.61803"/>
</Response>
Deephaven offers robust tools for inferring table schemas from sample data and importing XML files. Unlike simpler formats like CSV, XML's ability to represent nested and hierarchical data introduces additional complexity when mapping to Deephaven tables. The XML importer outlined in this guide supports several common scenarios, including extracting values from attributes or element text and handling varying levels of nesting. However, for more intricate XML structures, a custom importer may be necessary to achieve the desired mapping.
Generate a schema
See the Schema inference page for a guide on generating a schema from an XML file like the one above.
Deploy the schema
sudo -u irisadmin /usr/illumon/latest/bin/dhconfig schema import --file /tmp/XMLExampleNamespace.XMLExampleTableName.schema
See Deephaven Schemas for further details.
Import a schema file
The following command imports the XML file into the specified intraday partition:
sudo -u irisadmin /usr/illumon/latest/bin/dhconfig schema import --file <fully qualified schema file name>
where <fully qualified schema file name>
is the path to the schema file you want to import, e.g., /data/sample.xml
.
XML imports can be performed directly from the command line using the iris_exec
tool:
iris_exec xml_import <launch args> -- <xml import args>
XML Import Arguments
The XML importer takes the following arguments:
Argument | Description |
---|---|
| Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
|
-ns or --namespace <namespace> | (Required) Namespace in which to find the target table. |
-tn or --tableName <name> | (Required) Name of the target table. |
-fd or --delimiter <delimiter character> | Field Delimiter (Optional). Allows specification of a character to be used when parsing string representations of long or double arrays. |
-xi or --startIndex | Starting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found. |
-xd or --startDepth | Under the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import. |
-xm or --maxDepth | Starting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name. |
-xt or --elementType | The name or path of the element that will contain data elements. |
-ev or --useElementValues | Indicates that field values will be taken from element values e.g., <Price>10.25</> . |
-av or --useAttributeValues | Indicates that field values will be taken from attribute value, e.g., <Record ID="XYZ" Price="10.25" /> . |
-pv or --namedValues | When omitted, field values within the document will be named, e.g., a value called Price might be contained in an element named Price , or an attribute named Price . When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema. |
-om or --outputMode <import behavior> | (Optional):
|
-rc or --relaxedChecking <TRUE or FALSE> | (Optional) Defaults to FALSE . If TRUE , will allow target columns that are missing from the source XML to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source. |
| sourceDirectory , sourceFile , and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import. Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob . If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used. |
-sn or --sourceName <ImportSource name> | Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/XML/JSON/JDBC). |
One special case when importing XML data is columns in the target table with a type of Boolean.
- By default, the XML importer will attempt to interpret string data from the source table with
0
,F
,f
, or any case offalse
, being written as a Booleanfalse
. - Similarly,
1
,T
,t
, or any case oftrue
would be a Booleantrue
. - Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column.
- To convert from other representations of
true
/false
, for example from foreign language strings, a formula or transform would be needed, along with asourceType="String"
to ensure the reading of the data from the XML value is handled correctly.