Import CSV files using Builder

Deephaven provides tools for inferring a table schema from sample data and importing CSV files. CSV imports may be performed from inside a Deephaven Groovy or Python script. Compared to other methods, this permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec run_local_script tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

Example

The following script imports a single CSV file to a specified partition:

import com.illumon.iris.importers.util.CsvImport
import com.illumon.iris.importers.ImportOutputMode

rows = new CsvImport.Builder("Test","Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.csv")
    .setDestinationPartitions("localhost/2018-04-01")
    .setOutputMode(ImportOutputMode.REPLACE)
    .build()
    .run()

println "Imported " + rows + " rows."

from deephaven import *

rows = (
    CsvImport.Builder("Test", "Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.csv")
    .setDestinationPartitions("localhost/2018-04-01")
    .setOutputMode("REPLACE")
    .build()
    .run()
)

print("Imported {} rows.".format(rows))

Import API Reference

The CSV import class provides a static builder method, which produces an object used to set parameters for the import. The builder returns an import object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The general pattern when scripting an import is:

nRows = CsvImport.builder(<namespace>,<table>)
    .set<option>(<option value>)
    ...
    .build()
    .run()

CSV Import Options

Setter Method	Type	Req?	Default	Description
`setDestinationDirectory`	File \| String	No*	N/A	The destination directory. e.g., `/tmp/mylocation`.
`setOutputMode`	ImportOutputMode \| String	No	`SAFE`	Enumeration with the following options: `ImportOutputMode.SAFE \| ImportOutputMode.REPLACE \| ImportOutputMode.APPEND`. May also be specified as String (e.g.,`"SAFE"/"REPLACE"/"APPEND"`).
`setStrict`	boolean	No	`true`	If `true`, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source.
`setPartitionColumn`	String	No	N/A	Column name to use to choose which partition to import each source row.
`setSourceName`	String	No	N/A	Specific `ImportSource` to use (from the table schema). If not specified, the importer will use the first `ImportSource` block that it finds that matches the type of the import (CSV/XML/JSON/JDBC).
`setSourceDirectory`	String	No*	N/A	Directory from which to read source file(s).
`setSourceFile`	String	No*	N/A	Source file name (either full path on server filesystem or relative to specified source directory).
`setSourceGlob`	String	No*	N/A	Source file(s) wildcard expression.
`setFileFormat`	String	No	`DEFAULT`	The Apache Commons CSV parser is used to parse the file itself. Five common formats are supported: `DEFAULT` – default format if none is specified; comma-separated field values, newline row terminators, double-quotes around field values that contain embedded commas, newline characters, or double-quotes. `TRIM` - Similar to `DEFAULT`, but will trim all white space around values. `EXCEL` – The Microsoft Excel CSV format. `MYSQL` – The MySQL CSV format. `RFC4180` – IETF RFC 4180 MIME text/csv format. `TDF` – Tab-delimited format.
`setDelimiter`	char	No	`,`	Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, `fileFormat` is ignored. This must be a single character.
`setSkipLines`	int	No	0	Number of lines to skip before beginning parse (before header line, if any).
`setSkipFooterLines`	int	No	0	Number of footer lines to be skipped at the end of files being read.
`setTrim`	boolean	No	`false`	Whether to trim whitespace from field values.
`setConstantColumnValue`	String	No	N/A	A String to materialize as the source column when an `ImportColumn` is defined with a `sourceType` of `CONSTANT`.

* The sourceDirectory parameter will be used in conjunction with sourceFile or sourceGlob. If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.