Downsample Imports

Downsample imports are used to import a downsampled version of data from an existing time-series table into an intraday partition. The target table must already exist. Unlike other styles of import, Deephaven provides no "schema creator" tool, as the target table typically takes a very similar or identical form to the source table. The downsampler query reads historical data and produces a new historical table.

Quickstart

The following is an example of how to run a downsample import from the command line (this document also describes how to import via a script or a Deephaven UI Import Job). This example assumes a typical Deephaven installation and that the target table already exists. Note that this process imports from an existing historical partition.

sudo -u dbmerge iris_exec downsample_import -- \
-ns TargetNamespace -tn TargetQuoteTable \
-sn SourceNamespace -st SourceQuoteTable -sp 2018-09-26 \
-tc Timestamp -p 00:00:10 -kc Sym -last Bid,Ask,BidSize,AskSize \
-dp localhost/2018-09-26

This example downsamples the 2018-09-26 partition of the table SourceNamespace.SourceQuoteTable and stores the result in the target table (TargetNamespace.TargetQuoteTable). The Timestamp column is used to downsample the source table every 10 seconds for each key. "Sym" is identified as the key column, and the last values from each time interval from the Bid, Ask, BidSize and AskSize columns are selected.

Import Query

Note

The "Import - Downsample" option is only available in Deephaven Classic.

When Import - Downsample is selected, the Persistent Query Configuration Editor window shows the following options:

img

  • To proceed with creating a query to import downsampled data, you will need to select a DB Server and enter the desired value for Memory (Heap) Usage (GB).

  • Options available in the Show Advanced Options section of the panel are typically not used when importing or merging data. To learn more about this section, please refer to the Persistent Query Configuration Viewer/Editor.

  • The Access Control tab presents a panel with the same options as all other configuration types, and gives the query owner the ability to authorize Admin and Viewer Groups for this query. For more information, please refer to Access Control.

  • Clicking the Scheduling tab presents a panel with the same scheduling options as all other configuration types. For more information, please refer to Scheduling.

  • Clicking the DownsampleImport Settings tab presents a panel with relevant options:

    img

Downsample Import Settings

  • Target Namespace: This is the namespace into which you want to import the file.
  • Target Table: This is the table into which you want to import the data. The schema of this table must be consistent with the output of the downsampling process (as specified by the other options on this panel).
  • Output Mode: This determines what happens if data is found in the fully-specified partition for the data. The fully-specified partition includes both the internal partition (unique for the import job) and the column partition (usually the date).
    • Safe - if existing data is found in the fully-specified partition, the import job will fail.
    • Append - if existing data is found in the fully-specified partition, data will be appended to it.
    • Replace - if existing data is found in the fully-specified partition, it will be replaced. This does not replace all data for a column partition value, just the data in the fully-specified partition.
    • Strict Import - has no meaning (at present) for this type of import.
  • Downsample Parameters:
    • By Strategy - The Deephaven "byStrategy" to use when performing the downsample aggregate operations.
    • AJ Strategy - The Deephaven join strategy to use when joining source data to time bins (applicable only when setAllBins(true)).
    • Natural Join Strategy - The Deephaven join strategy to use when filtering time bins.
    • Number Threads - the number of threads to use when downsampling. For large downsampling jobs where "generate all time bins" is true, this can improve performance by parallelizing the downsample time-join on a per-key basis.
    • Partition Formula: Specifies which source column partition to import. This style of import supports only single-partition imports. A typical value for this might be currentDateNy().
    • Source Namespace: Namespace of the table that will be downsampled.
    • Source Table: Name of the table to downsample. This table must contain the timestamp and key columns specified in this panel. For performance reasons, it is recommended that large source tables be grouped by the key column(s).
    • Timestamp Column: Name of the column in the source table to use for generating time bins. Must be of type DBDateTime.
    • Period: Interval to use for generating time bins in HH:MM:SS format. Examples: 00:05:00 for 5 minute bins, 00:01:00 for 1 minute bins.
    • Calendar: The name of a calendar to use for filtering time stamps.
    • Time Bin Mode: How to assign timestamps in the source data to time bins. Given the time of 01:32:34 and a period of 5 minutes, UPPER would assign this time to the 01:35:00 bin, while LOWER would assign to the 01:30:00 bin. UPPER is a more typical choice.
    • Generate All Time Bins: Whether or not to generate time bins for the entire time range spanned by the underlying data, even if no data falls into the bin. Setting this to true requires one or more joins and hence will cause the downsampling process to be significantly slower.
    • Key Columns: Optional columns to use for grouping the source data. The downsampling will produce a row for each (time bin, key) combination. For financial series, the instrument identifier is a typical choice for this.
    • Aggregate Columns: Comma separated list of aggregate/data columns names desired in the downsampler output. These may match columns in the source table or rename then using the OutputColumn=OriginalColumn format. Typically the latter would be used for additional aggregates of source columns (for example Volume=Size used in combination with a sum aggregate on trade data).
    • Aggregates: Comma separated list of aggregates to use for each column specified in the Aggregate Columns field. There number of aggregates must exactly match the number of aggregate columns. Legal values for aggregates are: Last, First, Sum, Min, Max, Avg, Std, Var, and Array (Count is not available).
    • Maintain State Columns: Comma separated list of columns for which to maintain state from one time bin to the next if no data when generating all time bins. Typically would be used for Last aggregate columns where the value is "good until changed" (such as Bid and Ask in quote data).

Importing Using Builder

Downsample imports may be performed from inside a Deephaven Groovy or Python script. This permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec run_local_script tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

The DownsampleImport class provides a static builder method, which produced an object used to set parameters for the import. The builder returns a DownsampleImport object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The pattern when executing a Downsample Import follows:

nRows = DownsampleImport.builder(<db>, <namespace>, <table>, <timestamp column>, <period>, <key column>...)
  .set<option>(<option value>)
  ...
  .build()
  .run()

Required Builder Parameters

ParameterTypeReq?DefaultDescription
dbDatabaseYesN/AThe Deephaven database (typically "db" in a Python/Groovy script).
namespaceStringYesN/AThe target table namespace.
tableStringYesN/AThe target table name.
"Timestamp column"StringYesN/AName of the time column in the source table used for time sampling. The output time bin column will be the same as this unless it is renamed by specifying this as a pair (i.e., "TimeBin=Timestamp", where Timestamp is the input column and TimeBin is the output column).
periodStringYesN/ATime interval for sampling in HH:MM:SS format.
key column(s)String(s)YesN/AZero or more "key columns", used to group the data. There will be one row in the output for each distinct key value per time interval.

Import Options

Option SetterTypeReq?DefaultDescription
setDestinationPartitionsStringNo*N/AThe destination partition(s); e.g., "localhost/2018-05-01" for a single partition import or "localhost" for multi-partition import.
setDestinationDirectoryFile | StringNo*N/AThe destination directory; e.g., /tmp/mylocation.
setOutputModeImportOutputMode | StringNoSAFEEnumeration with the following options:
  • ImportOutputMode.SAFE
  • ImportOutputMode.REPLACE
  • ImportOutputMode.APPEND
May also be specified as String:
  • "SAFE"
  • "REPLACE"
  • "APPEND"
setStrictbooleanNotrueIf true, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source.
setPartitionColumnStringNoN/AColumn name to use to choose which partition to import each source row.
setSourceTableTableYesN/ASource table to downsample.
setTimeBinModeString | Downsampler.TimeBinModeNoUPPERMust be LOWER or UPPER. This indicates how to generate the time bins from source timestamps. LOWER indicates to map a timestamp to the first previous time that falls on an even time interval, UPPER the next.
setAllBinsbooleanNofalseIndicates whether or not to generate all time bins over the interval covered by the source table. If false, only intervals for which source data exists will be generated.
setMaintainStateColumnsString...NoN/AIndicates a subset of one or more output columns that should "maintain state", or maintain the same value from one time interval to the next even if no new data is observed. Typically used with setAllBins(true).
setByStrategyString | Table.ByStrategyNoDEFAULTThe Deephaven "byStrategy" to use when performing the downsample aggregate operations.
setAjStrategyString | Table.JoinStrategyNoDEFAULTThe Deephaven join strategy to use when joining source data to time bins (applicable only when setAllBins(true)).
setNaturalJoinStrategyString | Table.JoinStrategyNoDEFAULTThe Deephaven join strategy to use when filtering time bins.
setCalendarStringNoN/AA Deephaven calendar to use for filtering output time bins.
setNumThreadsintNo1Maximum number of threads to use when downsampling. Downsampling can be parallelized to some extent if key columns are present.
setLoggerLoggerNoDefault loggerDeephaven logger object for logging progress.
addAggregatesComboAggregateFactory.ComboBy...NoN/AAdd the given aggregates to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations (addLastColumns, etc).
addAggregateAggType, StringNoN/AAdd the given aggregate to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations (addLastColumns, etc).
addLastColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a lastBy operation from the source data.
addFirstColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a firstBy operation from the source data.
addMinColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a minBy operation from the source data.
addMaxColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a maxBy operation from the source data.
addSumColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a sumBy operation from the source data.
addStdColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a stdBy operation from the source data.
addVarColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by a varBy operation from the source data.
addAvgColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by an avgBy operation from the source data.
addArrayColumnsString...NoN/AAdd the given columns to the downsample output, to be generated by an arrayBy operation from the source data.

Import from Command Line

Downsample imports from the command line work in a very similar way to the scripted imports. The primary difference is that since the source table is specified by name (instead of a Table object), you must also supply a source partition (i.e., Date) from which to import.

The syntax for running downsample imports from the command line is described below. You may also issue the import command without arguments to see them described in the console.

Command Reference

iris_exec downsample_import <launch args> -- <downsample import args>

Downsample Import Arguments

ArgumentDescription
  • -dd or --destinationDirectory <path>
  • -dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
  • -pcor--intradayPartition <partition column name>
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
  1. Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  2. Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  3. Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)
-ns or --namespace <namespace>(Required) Namespace in which to find the target table.
-tn or --tableName <name>(Required) Name of the target table.
-om or --outputMode <import behavior>(Optional):
  • SAFE (default)- SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted.
  • When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort.
  • APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.
-rc or --relaxedChecking <TRUE or FALSE>(Optional) Defaults to FALSE. Presently ignored by Downsampler.
-sn or --sourceNamespace <source namespace>(Required) Namespace in which to find the source table.
-st or --sourceTable <source table>(Required) Name of the source table.
-tc or --timestampColumn <timestamp column>Name of the column in the source table to use for time sampling. The output time bin column will be the same as this unless it is renamed by specifying this as a pair (i.e., "TimeBin=Timestamp", where Timestamp is the input column and TimeBin is the output column).
-p or --period <period>(Required) Time interval for sampling, specified in HH:MM:SS format.
-kc or --keyColumns <key columns>(Optional) One or more column separated "key columns", used to group the data. There will be one row in the output for each distinct key value per time interval.
-sp or --sourcePartition <source partition>(Optional) Source column partition to downsample, typically a date. This is optional but highly recommended for partitioned tables.
-timeBinMode <LOWER or UPPER>(Optional) Must be LOWER or UPPER, default is UPPER. This indicates how to generate the time bins from source timestamps. LOWER indicates to map a timestamp to the first previous time that falls on an even time interval, UPPER the next.
-allBins <true or false>(Optional) Must be true or false; default is false. Indicates whether or not to generate all time bins over the interval covered by the source table. If false, only intervals for which source data exists will be generated.
-maintainStateColumns <maintain state columns>(Optional) A comma delimited subset of one or more output columns that should "maintain state", or maintain the same value from one time interval to the next even if no new data is observed. Typically used with setAllBins(true).
-byStrategy <by strategy>(Optional) The Deephaven "byStrategy" to use when performing the downsample aggregate operations.
-ajStrategy <aj strategy>(Optional): The Deephaven join strategy to use when joining source data to time bins (applicable only when setAllBins(true)).
-naturalJoinStrategy <natural join strategy>(Optional) The Deephaven join strategy to use when filtering time bins.
-calendar(Optional) A Deephaven calendar name to use for filtering output time bins.
-last <last columns>(Optional) One or more comma delimited column names, to be aggregated by a lastBy operation from the source data.
-first <first columns>(Optional) One or more comma delimited column names, to be aggregated by a firstBy operation from the source data.
-sum <sum columns>(Optional) One or more comma delimited column names, to be aggregated by a sumBy operation from the source data.
-min <min columns>(Optional) One or more comma delimited column names, to be aggregated by a minBy operation from the source data.
-max <max columns>(Optional) One or more comma delimited column names, to be aggregated by a maxBy operation from the source data.
-std <standard deviation columns>(Optional) One or more comma delimited column names, to be aggregated by a stdBy operation from the source data.
-var <variance columns>(Optional) One or more comma delimited column names, to be aggregated by a varBy operation from the source data.
-avg <average columns>(Optional) One or more comma delimited column names, to be aggregated by an avgBy operation from the source data.
-array <array columns>(Optional) One or more comma delimited column names, to be aggregated by an arrayBy operation from the source data.