Import XML using Builder

Deephaven provides tools for inferring a table schema from sample data and importing XML files. Because XML can represent nested/hierarchical data in many different ways, mapping to Deephaven tables is more complex that for a simple format like CSV. The XML importer described here can handle a few common variations - extraction of value from either attributes or element text, and different levels of nesting, but some XML formats may require a custom importer.

Example

The following script imports a single XML file to a specified partition. This import uses options consistent with the XML Quickstart example.

import com.illumon.iris.importers.util.XmlImport
import com.illumon.iris.importers.ImportOutputMode

rows = new XmlImport.Builder("Test","Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.xml")
    .setDestinationPartitions("localhost/2018-04-01")
    .setElementType("Record")
    .setUseAttributeValues(true)
    .setUseElementValues(false)
    .setStartDepth(0)
    .setOutputMode(ImportOutputMode.REPLACE)
    .build()
    .run()

println "Imported " + rows + " rows."
from deephaven import *

rows = (
    XmlImport.builder("Test", "Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.xml")
    .setDestinationPartitions("localhost/2018-04-01")
    .setElementType("Record")
    .setUseAttributeValues(True)
    .setUseElementValues(False)
    .setStartDepth(0)
    .setOutputMode("REPLACE")
    .build()
    .run()
)

print("Imported {} rows.".format(rows))

Import API Reference

The XML import class provides a static builder method, which produces an object used to set parameters for the import. The builder returns an import object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The general pattern when scripting an import is:

nRows = XmlImport.builder(<namespace>,<table>)
    .set<option>(<option value>)

    .build()
    .run()

XML Import Options

Setter MethodTypeReq?DefaultDescription
setSourceDirectoryStringNo*N/ADirectory from which to read source file(s)..
setSourceFileStringNo*N/ASource file name (either full path on server filesystem or relative to specified source directory).
setSourceGlobStringNo*N/ASource file(s) wildcard expression.
setDelimitercharNo,Allows specification of a character when parsing string representations of long or double arrays.
setElementTypeStringYesN/AThe name or path of the element that will contain data elements. This will be the name of the element which holds your data.
setStartIndexintNo0Starting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found.
setStartDepthintNo1Under the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import.
setMaxDepthintNo1Starting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name.
setUseAttributeValuesbooleanNofalseIndicates that field values will be taken from attribute value; e.g., <Record ID="XYZ" Price="10.25" />
setUseElementValuesbooleanNotrueIndicates that field values will be taken from element values; e.g., <Price>10.25</>
setPositionValuesbooleanNofalseWhen false, field values within the document will be named; e.g., a value called Price might be contained in an element named Price, or an attribute named Price. When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.
setConstantColumnValueStringNoN/AA String to materialize as the source column when an ImportColumn is defined with a sourceType of CONSTANT.

* The sourceDirectory parameter will be used in conjunction with sourceFile or sourceGlob. If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.