Importing from JDBC

Deephaven provides tools for inferring a table schema from JDBC and importing data directly through a JDBC interface. Note that both the schema inference and imports operate on a JDBC ResultSet, not directly on the source table. However, JDBC drivers generally provide rich metadata with the ResultSet, so most standard SQL types can be mapped directly to the appropriate Deephaven column type.

JDBC Data Source Support

When JDBC data sources are used through command line tools or data import classes, any driver class string and connection URL can be specified. The schema discovery and import processes use fairly basic JDBC functionality, so most drivers should work. For schema discovery, though, the driver needs to be able to provide ResultSetMetaData about the query result set; some drivers may not provide this functionality.

For JDBC use in the Schema Editor UI, a limited set of data sources are currently available:

  • MySQL/MariaDB
  • Apache Redshift
  • Microsoft SQL Server
  • PostgreSql

These data sources have been tested and their typical drivers are known to support features needed for schema discovery and data import.

In either case, it may be necessary to review product documentation for the data source for details about how to connect to it using a JDBC driver. For example, the data source may need to be reconfigured to allow a username and password connection from the Deephaven merge server on which the import process or Schema Editor is being run. Some drivers also have specific requirements for the connection URL: e.g., Postgres requires the database name in the connection URL, while MySQL does not.

Quickstart

Here is an example of how to generate a schema from an SQL query ResultSet, deploy it, and perform a JDBC import from the command line. These commands assume a typical Deephaven installation and a SQL Server source database.

Given an SQL Server table created as follows:

create table trade (
id int primary key,
trade_date date,
sym varchar(50),
qty int,
px double PRECISION,
comment varchar(150));

Generate a schema from a JDBC query

Note that we limit the size of the result to avoid pulling the entire table into memory.

iris_exec jdbc_schema_creator -- \
-ns ExNamespace -tn Trade \
-cu "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>" \
-jd "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
-jq "SELECT TOP 1000 * FROM trade"

The schema file will be generated in the ExNamespace directory.

ls -l ExNamespace/*.schema

Deploy the schema

sudo -u irisadmin /usr/illumon/latest/bin/dhconfig schema import --file <fully qualified schema file name>

Import a single data file into the specified Intraday partition

sudo -u dbmerge /usr/illumon/latest/bin/iris_exec jdbc_import -- \
-ns ExNamespace -tn Trade \
-cu "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>" \
-jd "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
-jq "SELECT * FROM trade" \
-dp localhost/2018-09-26

This example will generate a table schema from the Trade table with the namespace ExNamespace and table name Trade. The schema creator creates a directory for the namespace in the current directory and places the schema file inside this directory. Once the schema is deployed, imports from the table may be scheduled.

Schema Inference

JDBC schema inference generates a table schema from a JDBC query ResultSet. A table column is generated for each column in the query result, and the process maps the SQL type to an appropriate Deephaven type. Column names are used as-is unless a systematic conversion is specified (the --sourceCasing and --destCasing arguments). While the presence of metadata makes the type mapping less error-prone for most import types, the generated schema should still be inspected before deployment.

Command Reference

iris_exec jdbc_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the JDBC schema creator:

ArgumentDescription
-ns or --namespace <namespace>(Required) The namespace to use for the new schema.
-tn or --tableName <name>(Required) The table name to use for the new schema.
-sp or --schemaPath <path>An optional path to which the schema file will be written. If not specified, this will default to the current working directory and will create or use a subdirectory that matches the namespace.
-jd or --driverClass <fully qualified class name of the JDBC driver>The fully-qualified class name of the driver to load to connect to the data source. In most cases, the corresponding jar file will need to be available in the classpath of the schema creator process.
-cu or --connectionUrl <data source specific connection URL>Information needed to find and connect to the data source. See below for more details.
-jq or --jdbcQuery <SQL query to execute against the data source>The full query string to execute against the data source. In most cases, this will need to be delimited with double-quotes.
-am or --assessmentMethod <META_ONLY, DATA_ONLY, or BOTH>How to evaluate column types from the JDBC data source.
  • META_ONLY (default) uses the metadata of the result set that is provided by the JDBC driver, mapping JDBC data types to Deephaven data types.
  • DATA_ONLY scans the data from the result set similarly to how the CSV schema creator scans data, which is useful for data sources that do not support many data types, those that do not provide metadata, or for situations where data is stored non-optimally in the source (e.g., date/time values in string columns).
  • BOTH will use metadata first, and then validate and refine the results by scanning the actual data.
-pc or --partitionColumnOptional name for the partitioning column if schema is being generated. If this is not provided, the importer will default to Date for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
-gc or --groupingColumnOptional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-spc or --sourcePartitionColumnOptional column to use for multi-partition imports. For example if the partitioning column is "Date" and you want to enable multi-partition imports based on the column source_date, specify "source_date" with this option (Note: This is the column name in the data source, not the Deephaven column name).
-sn or --sourceName <name for ImportSource block>Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisJDBC" will be used.
-lp or --logProgressIf present, additional informational logging will be provided with progress updates during the parsing process.
-bf or --bestFitIf present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without -bf, is to use long for all integer columns and double for all floating point columns.
-om or --outputModeEither SAFE (default) or REPLACE. When SAFE, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE, a pre-existing file will be overwritten.
-is or --importStyleEither BATCH (default) or CDC_REPLICATION. CDC_REPLICATION indicates the schema should be configured to use a SQLReplicationLogger as the logger interface and take an SQLReplicationRow object as input data. Reasonable defaults for converting SQL types to to database types will also be generated for each column.
-op or --outPackageThe output package name (i.e., "com.yourcompany.iris.db.gen") when generating loggers and listeners. Required if generating loggers & listeners.
-lf or --logFormat(Optional) Default unspecified. Logger format number. Used to version log files if table schema is changed.
-un or --useNanos(Optional) Defaults to true. If true, each DBDateTime column will indicate timePrecision="Nanos", and the logger will expect a long value with nanos. If false, the loggers and listeners will expect millisecond values.
-sc or --sourceCasingOptional unless destination casing is given. Specifies the casing to expect for source column names. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: UPPER_UNDERSCORE, LOWER_UNDERSCORE, UPPER_CAMEL, LOWER_CAMEL, and LOWER_HYPHEN.
-dc or --destCasingOptional unless source casing is given. Determines the Deephaven column name format. If both this and destination casing are specified, column names will be converted accordingly. Legal values are: UPPER_UNDERSCORE, LOWER_UNDERSCORE, UPPER_CAMEL, LOWER_CAMEL, and LOWER_HYPHEN.

Import Query

Note

The "Import-JDBC" option is only available in Deephaven Classic.

When Import - JDBC is selected, the Persistent Query Configuration Editor window shows the following options:

img

  • To proceed with creating a query to import a JDBC file, you will need to select a DB Server and enter the desired value for Memory (Heap) Usage (GB).

  • Options available in the Show Advanced Options section of the panel are typically not used when importing or merging data. To learn more about this section, please refer to the Persistent Query Configuration Viewer/Editor.

  • The Access Control tab presents a panel with the same options as all other configuration types, and gives the query owner the ability to authorize Admin and Viewer Groups for this query. For more information, please refer to Access Control.

  • Clicking the Scheduling tab presents a panel with the same scheduling options as all other configuration types. For more information, please refer to Scheduling.

  • Clicking the JdbcImport Settings tab presents a panel with relevant options:

    img

JdbcImport Settings

  • Namespace: This is the namespace into which you want to import the file.
  • Table: This is the table into which you want to import the data.
  • Output Mode: This determines what happens if data is found in the fully-specified partition for the data. The fully-specified partition includes both the internal partition (unique for the import job) and the column partition (usually the date).
    • Safe - if existing data is found in the fully-specified partition, the import job will fail.
    • Append - if existing data is found in the fully-specified partition, data will be appended to it.
    • Replace - if existing data is found in the fully-specified partition, it will be replaced. This does not replace all data for a column partition value, just the data in the fully-specified partition. Import Source: This is the import source section of the associated schema file that specifies how source data columns will be set as Deephaven columns.
    • Strict Import will fail if a file being imported has missing column values (nulls), unless those columns allow or replace the null values using a default attribute in the ImportColumn definition.
  • Single/Multi Partition: This controls the import mode. In single-partition, all of the data is imported into a single Intraday partition. In multi-partition mode, you must specify a column in the source data that will control to which partition each row is imported.
    • Single-partition configuration:
      • Partition Formula: This is the formula needed to partition the JDBC file being imported. If a specific partition value is used it will need to be surrounded by quotes. For example: currentDateNy() "2017-01-01"
      • Partition Substitution: This is a token used to substitute the determined column partition value in the source directory, source file, or source glob, to allow the dynamic determination of these fields. For example, if the partition substitution is "PARTITION_SUB", and the source directory includes "PARTITION_SUB" in its value, that PARTITION_SUB will be replaced with the partition value determined from the partition formula.
      • Substitution Date Format: This is the date format that will be used when a Partition Substitution is used. The standard Deephaven date partition format is yyyy-MM-dd (e.g., 2018-05-30), but this allows substitution in another format. For example, if the filename includes the date in yyyyddMM format instead (e.g., 20183005), that could be used in the Date Substitution Format field. All the patterns from the Java DateTimeFormatter class are allowed.
    • Multi-partition configuration:
      • Import Partition Column: This is the name of the database column used to choose the target partition for each row (typically "Date"). There must be an corresponding Import Column present in the schema, which will indicate how to get this value from the source data.
  • Driver: This is the name of the driver needed to access the database. It defaults to Mysql.
  • URL: This the URL for the JDBC connection.
  • Query: This is the query detailing what data you want to pull from the database.

Importing Using Builder

JDBC imports may be performed from inside a Deephaven Groovy or Python script. Compared to other methods, this permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec run_local_script tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

Examples

The following script imports all the records from a SQL Server table into a single partition:

import com.illumon.iris.importers.util.JdbcImport
import com.illumon.iris.importers.ImportOutputMode

rows=new JdbcImport.Builder("ExNamespace", "Trade")
    .setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .setConnectionUrl( "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>")
    .setQuery("SELECT * FROM trade")
    .setDestinationPartitions("localhost/2018-05-01")
    .setStrict(false)
    .setOutputMode(ImportOutputMode.SAFE)
    .build()
    .run()

println "Imported " + rows + " rows."
from deephaven import *

rows = JdbcImport.Builder("ExNamespace", "Trade")\
    .setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver")\
    .setConnectionUrl( "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>")\
    .setQuery("SELECT * FROM trade")\
    .setDestinationPartitions("localhost/2018-05-01")\
    .setStrict(False)\
    .setOutputMode("SAFE")\
    .build()\
    .run()

print("Imported {} rows.".format(rows))

The following script imports all records into partitions based on the "Date" column, with the user and password values passed in via the iris_exec command line. This is a useful idiom when importing large datasets. Note that the input query includes a sort clause on the column being used for partitioning; this is important for optimal performance when performing multi-partition imports.

import com.illumon.iris.importers.util.JdbcImport
import com.illumon.iris.importers.ImportOutputMode

rows=new JdbcImport.builder("ExNamespace", "Trade")
    .setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .setConnectionUrl( "jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>")
    .setUser(user)
    .setPassword(password)
    .setSourceTimeZone("UTC")
    .setQuery("SELECT * FROM table1 ORDER BY date")
    .setDestinationPartitions("localhost")
    .setPartitionColumn("Date")
    .build()
    .run()

println "Imported " + rows + " rows."
from deephaven import *

rows = JdbcImport.builder("ExNamespace", "Trade")\
    .setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver")\
    .setConnectionUrl("jdbc:sqlserver://<host>;database=<db>;user=<user>;password=<pwd>")\
    .setUser(user)\
    .setPassword(password)\
    .setSourceTimeZone("UTC")\
    .setQuery("SELECT * FROM table1 ORDER BY date")\
    .setDestinationPartitions("localhost")\
    .setPartitionColumn("Date")\
    .build()\
    .run()

print("Imported {} rows.".format(rows))

Import API Reference

The JDBC import class provides a static builder method, which produces an object used to set parameters for the import. The builder returns an import object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The general pattern when scripting an import is:

nRows = JdbcImport.builder(<namespace>,<table>)
    .set<option>(<option value>)

    .build()
    .run()

JDBC Import Options

Setter MethodTypeReq?DefaultDescription
setDestinationPartitionsStringNo*N/AThe destination partition(s). e.g., "localhost/2018-05-01" for a single partition import or "localhost" for multi-partition import.
setDestinationDirectoryFile | StringNo*N/AThe destination directory. e.g., /tmp/mylocation.
setOutputModeImportOutputMode | StringNoSAFEEnumeration with the following options: ImportOutputMode.SAFE | ImportOutputMode.REPLACE | ImportOutputMode.APPEND. May also be specified as String (e.g.,"SAFE"/"REPLACE"/"APPEND").
setStrictbooleanNotrueIf true, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source.
setPartitionColumnStringNoN/AColumn name to use to choose which partition to import each source row.
setSourceNameStringNoN/ASpecific ImportSource to use (from the table schema). If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/XML/JSON/JDBC).
setDriverStringYesN/AJDBC driver class; e.g., com.microsoft.sqlserver.jdbc.SQLServerDriver for SQL Server.
setConnectionUrlStringYesN/AJDBC connection string.
setUserStringNoN/AJDBC user (if not specified in connection string).
setPasswordStringNoN/AJDBC password (if not specified in connection string).
setQueryStringYesN/ASQL query to execute for import.
setSourceTimeZoneTimeZoneNoServer TZ*Time zone to be used when interpreting time & date values that do not have explicit offsets.

* Unless the source time zone is specified, the JDBC import task attempts to read the server time zone/UTC offset prior to executing the query. The source time zone is used to interpret SQL "datetime", "date", and "time-of-day" values, unless those types explicitly contain an offset. Since Deephaven presently has no "date" type, date values are interpreted as a datetime at midnight in the server time zone.

Import from Command Line

JDBC imports can be performed directly from the command line, using the iris_exec tool.

Command Reference

iris_exec jdbc_import <launch args> -- <jdbc import args>

JDBC Import Arguments

The JDBC importer takes the following arguments:

ArgumentDescription
  • -dd or --destinationDirectory <path>
  • -dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
  • -pcor--intradayPartition <partition column name>
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
  1. Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  2. Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  3. Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)
-ns or --namespace <namespace>(Required) Namespace in which to find the target table.
-tn or --tableName <name>(Required) Name of the target table.
-om or --outputMode <import behavior>(Optional):
  • SAFE (default)- SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted.
  • When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort.
  • APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.
-rc or --relaxedChecking <TRUE or FALSE>(Optional) Defaults to FALSE. If TRUE, will allow target columns that are missing from the source JDBC to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.
-sn or --sourceName <ImportSource name>Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/XML/JSON/JDBC).
-jd or --jdbcDriver <class name>Class name of the driver to load to connect to the data source. See below for more details.
-cu or --connectionUrl <JDBC URL>Information needed to find and connect to the data source. See below for more details.
-jq or --jdbcQuery <query string>The full query string to execute against the data source. In most cases, this will need to be delimited with double-quotes.

The schema of the target table will be used to select default setters for transformation of incoming data if there are no ImportColumns to provide other instructions. For example, if the source column is a string, but the Deephaven table column is designated as integer, the importer will attempt to cast the data as integer.

  • The driver will need to be available from the classpath of the importer.
    • Examples of driver classes are:
      • com.microsoft.sqlserver.jdbc.SQLServerDriver for Microsoft SQL Server
      • com.mysql.jdbc.Driver for MySQL
  • The connectionURL will include things like server name, port, authentication, etc, that will vary depending on the type of data source being used.
  • An example of a SQL Server URL for a local, default instance, using integrated Windows security, might be something like the following:
    • jdbc:sqlserver://localhost:1433;databaseName=Sample;integratedSecurity=true
  • Other sources may be a bit different. For example, a sample MySQL connectionURL follows:
    • jdbc:mysql://localhost:3306/Sample?user=root&password=secret

The SQL query format will also vary depending on specific language capabilities and expectations of the data source. However, the resultant columns should match the column names in the target table. Most SQL dialects will allow columns to be calculated or labeled as part of the query, and the importer will match columns based on the resultant labels. Rather than just directly matching result set column names to target table column names, another option is to add transformation metadata to the Deephaven schema. This allows the user to specify options like a different source column name, a different source column data type, whether a column can be null, and even custom transform functions to use during the import. See the general importer topic for a sample of schema for an import target table.

In addition to column translation features of the CSV importer, the JDBC importer will perform special handling when the source column is a Date type (JDBC DateTime, Date, or Timestamp), and the destination is a DBDateTime. This handling will attempt to do direct conversion between date values.

Note

DBDateTime is restricted to dates in the range of 09/25/1677 to 04/11/2262.