Import CSV files using Builder

Deephaven provides tools for inferring a table schema from sample data and importing CSV files. CSV imports may be performed from inside a Deephaven Groovy or Python script. Compared to other methods, this permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec run_local_script tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

Example

The following script imports a single CSV file to a specified partition:

import com.illumon.iris.importers.util.CsvImport
import com.illumon.iris.importers.ImportOutputMode

rows = new CsvImport.Builder("Test","Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.csv")
    .setDestinationPartitions("localhost/2018-04-01")
    .setOutputMode(ImportOutputMode.REPLACE)
    .build()
    .run()

println "Imported " + rows + " rows."
from deephaven import *

rows = (
    CsvImport.Builder("Test", "Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.csv")
    .setDestinationPartitions("localhost/2018-04-01")
    .setOutputMode("REPLACE")
    .build()
    .run()
)

print("Imported {} rows.".format(rows))

Import API Reference

The CSV import class provides a static builder method, which produces an object used to set parameters for the import. The builder returns an import object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The general pattern when scripting an import is:

nRows = CsvImport.builder(<namespace>,<table>)
    .set<option>(<option value>)
    ...
    .build()
    .run()

CSV Import Options

Setter MethodTypeReq?DefaultDescription
setDestinationDirectoryFile | StringNo*N/AThe destination directory. e.g., /tmp/mylocation.
setOutputModeImportOutputMode | StringNoSAFEEnumeration with the following options: ImportOutputMode.SAFE | ImportOutputMode.REPLACE | ImportOutputMode.APPEND. May also be specified as String (e.g.,"SAFE"/"REPLACE"/"APPEND").
setStrictbooleanNotrueIf true, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source.
setPartitionColumnStringNoN/AColumn name to use to choose which partition to import each source row.
setSourceNameStringNoN/ASpecific ImportSource to use (from the table schema). If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/XML/JSON/JDBC).
setSourceDirectoryStringNo*N/ADirectory from which to read source file(s).
setSourceFileStringNo*N/ASource file name (either full path on server filesystem or relative to specified source directory).
setSourceGlobStringNo*N/ASource file(s) wildcard expression.
setFileFormatStringNoDEFAULTThe Apache Commons CSV parser is used to parse the file itself. Five common formats are supported:
  • DEFAULT – default format if none is specified; comma-separated field values, newline row terminators, double-quotes around field values that contain embedded commas, newline characters, or double-quotes.
  • TRIM - Similar to DEFAULT, but will trim all white space around values.
  • EXCEL – The Microsoft Excel CSV format.
  • MYSQL – The MySQL CSV format.
  • RFC4180 – IETF RFC 4180 MIME text/csv format.
  • TDF – Tab-delimited format.
setDelimitercharNo,Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, fileFormat is ignored. This must be a single character.
setSkipLinesintNo0Number of lines to skip before beginning parse (before header line, if any).
setSkipFooterLinesintNo0Number of footer lines to be skipped at the end of files being read.
setTrimbooleanNofalseWhether to trim whitespace from field values.
setConstantColumnValueStringNoN/AA String to materialize as the source column when an ImportColumn is defined with a sourceType of CONSTANT.

* The sourceDirectory parameter will be used in conjunction with sourceFile or sourceGlob. If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.