Downsample Imports
Downsample imports are used to import a downsampled version of data from an existing time-series table into an intraday partition. The target table must already exist. Unlike other styles of import, Deephaven provides no "schema creator" tool, as the target table typically takes a very similar or identical form to the source table. The downsampler query reads historical data and produces a new historical table.
Import Query
Note
The "Import - Downsample" option is only available in Deephaven Classic.
When Import - Downsample is selected, the Persistent Query Configuration Editor window shows the following options:
-
To proceed with creating a query to import downsampled data, you will need to select a DB Server and enter the desired value for Memory (Heap) Usage (GB).
-
Options available in the Show Advanced Options section of the panel are typically not used when importing or merging data. To learn more about this section, please refer to the Persistent Query Configuration Viewer/Editor.
-
The Access Control tab presents a panel with the same options as all other configuration types, and gives the query owner the ability to authorize Admin and Viewer Groups for this query. For more information, please refer to Access Control.
-
Clicking the Scheduling tab presents a panel with the same scheduling options as all other configuration types. For more information, please refer to Scheduling.
-
Clicking the DownsampleImport Settings tab presents a panel with relevant options:
Downsample Import Settings
- Target Namespace: This is the namespace into which you want to import the file.
- Target Table: This is the table into which you want to import the data. The schema of this table must be consistent with the output of the downsampling process (as specified by the other options on this panel).
- Output Mode: This determines what happens if data is found in the fully-specified partition for the data. The fully-specified partition includes both the internal partition (unique for the import job) and the column partition (usually the date).
Safe
- if existing data is found in the fully-specified partition, the import job will fail.Append
- if existing data is found in the fully-specified partition, data will be appended to it.Replace
- if existing data is found in the fully-specified partition, it will be replaced. This does not replace all data for a column partition value, just the data in the fully-specified partition.Strict Import
- has no meaning (at present) for this type of import.
- Downsample Parameters:
- By Strategy - The Deephaven "byStrategy" to use when performing the downsample aggregate operations.
- AJ Strategy - The Deephaven join strategy to use when joining source data to time bins (applicable only when
setAllBins(true)
). - Natural Join Strategy - The Deephaven join strategy to use when filtering time bins.
- Number Threads - the number of threads to use when downsampling. For large downsampling jobs where "generate all time bins" is true, this can improve performance by parallelizing the downsample time-join on a per-key basis.
- Partition Formula: Specifies which source column partition to import. This style of import supports only single-partition imports. A typical value for this might be
currentDateNy()
. - Source Namespace: Namespace of the table that will be downsampled.
- Source Table: Name of the table to downsample. This table must contain the timestamp and key columns specified in this panel. For performance reasons, it is recommended that large source tables be grouped by the key column(s).
- Timestamp Column: Name of the column in the source table to use for generating time bins. Must be of type DBDateTime.
- Period: Interval to use for generating time bins in HH:MM:SS format. Examples: 00:05:00 for 5 minute bins, 00:01:00 for 1 minute bins.
- Calendar: The name of a calendar to use for filtering time stamps.
- Time Bin Mode: How to assign timestamps in the source data to time bins. Given the time of 01:32:34 and a period of 5 minutes,
UPPER
would assign this time to the 01:35:00 bin, whileLOWER
would assign to the 01:30:00 bin.UPPER
is a more typical choice. - Generate All Time Bins: Whether or not to generate time bins for the entire time range spanned by the underlying data, even if no data falls into the bin. Setting this to
true
requires one or more joins and hence will cause the downsampling process to be significantly slower. - Key Columns: Optional columns to use for grouping the source data. The downsampling will produce a row for each (time bin, key) combination. For financial series, the instrument identifier is a typical choice for this.
- Aggregate Columns: Comma separated list of aggregate/data columns names desired in the downsampler output. These may match columns in the source table or rename then using the
OutputColumn=OriginalColumn
format. Typically the latter would be used for additional aggregates of source columns (for exampleVolume=Size
used in combination with a sum aggregate on trade data). - Aggregates: Comma separated list of aggregates to use for each column specified in the Aggregate Columns field. There number of aggregates must exactly match the number of aggregate columns. Legal values for aggregates are: Last, First, Sum, Min, Max, Avg, Std, Var, and Array (Count is not available).
- Maintain State Columns: Comma separated list of columns for which to maintain state from one time bin to the next if no data when generating all time bins. Typically would be used for Last aggregate columns where the value is "good until changed" (such as Bid and Ask in quote data).
Import using Builder
Downsample imports may be performed from inside a Deephaven Groovy or Python script. This permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec
run_local_script
tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).
The DownsampleImport
class provides a static builder method, which produced an object used to set parameters for the import. The builder returns a DownsampleImport
object from the build()
method. Imports are executed via the run()
method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The pattern when executing a Downsample Import follows:
nRows = DownsampleImport.builder(<db>, <namespace>, <table>, <timestamp column>, <period>, <key column>...)
.set<option>(<option value>)
...
.build()
.run()
Required Builder Parameters
Parameter | Type | Req? | Default | Description |
---|---|---|---|---|
db | Database | Yes | N/A | The Deephaven database (typically "db" in a Python/Groovy script). |
namespace | String | Yes | N/A | The target table namespace. |
table | String | Yes | N/A | The target table name. |
"Timestamp column" | String | Yes | N/A | Name of the time column in the source table used for time sampling. The output time bin column will be the same as this unless it is renamed by specifying this as a pair (i.e., "TimeBin=Timestamp" , where Timestamp is the input column and TimeBin is the output column). |
period | String | Yes | N/A | Time interval for sampling in HH:MM:SS format. |
key column(s) | String(s) | Yes | N/A | Zero or more "key columns", used to group the data. There will be one row in the output for each distinct key value per time interval. |
Import Options
Option Setter | Type | Req? | Default | Description |
---|---|---|---|---|
setDestinationPartitions | String | No* | N/A | The destination partition(s); e.g., "localhost/2018-05-01" for a single partition import or "localhost" for multi-partition import. |
setDestinationDirectory | File | String | No* | N/A | The destination directory; e.g., /tmp/mylocation . |
setOutputMode | ImportOutputMode | String | No | SAFE | Enumeration with the following options:
|
setStrict | boolean | No | true | If true , will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source. |
setPartitionColumn | String | No | N/A | Column name to use to choose which partition to import each source row. |
setSourceTable | Table | Yes | N/A | Source table to downsample. |
setTimeBinMode | String | Downsampler.TimeBinMode | No | UPPER | Must be LOWER or UPPER . This indicates how to generate the time bins from source timestamps. LOWER indicates to map a timestamp to the first previous time that falls on an even time interval, UPPER the next. |
setAllBins | boolean | No | false | Indicates whether or not to generate all time bins over the interval covered by the source table. If false , only intervals for which source data exists will be generated. |
setMaintainStateColumns | String... | No | N/A | Indicates a subset of one or more output columns that should "maintain state", or maintain the same value from one time interval to the next even if no new data is observed. Typically used with setAllBins(true) . |
setByStrategy | String | Table.ByStrategy | No | DEFAULT | The Deephaven "byStrategy" to use when performing the downsample aggregate operations. |
setAjStrategy | String | Table.JoinStrategy | No | DEFAULT | The Deephaven join strategy to use when joining source data to time bins (applicable only when setAllBins(true) ). |
setNaturalJoinStrategy | String | Table.JoinStrategy | No | DEFAULT | The Deephaven join strategy to use when filtering time bins. |
setCalendar | String | No | N/A | A Deephaven calendar to use for filtering output time bins. |
setNumThreads | int | No | 1 | Maximum number of threads to use when downsampling. Downsampling can be parallelized to some extent if key columns are present. |
setLogger | Logger | No | Default logger | Deephaven logger object for logging progress. |
addAggregates | ComboAggregateFactory.ComboBy... | No | N/A | Add the given aggregates to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations (addLastColumns , etc). |
addAggregate | AggType, String | No | N/A | Add the given aggregate to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations (addLastColumns , etc). |
addLastColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a lastBy operation from the source data. |
addFirstColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a firstBy operation from the source data. |
addMinColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a minBy operation from the source data. |
addMaxColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a maxBy operation from the source data. |
addSumColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a sumBy operation from the source data. |
addStdColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a stdBy operation from the source data. |
addVarColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by a varBy operation from the source data. |
addAvgColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by an avgBy operation from the source data. |
addArrayColumns | String... | No | N/A | Add the given columns to the downsample output, to be generated by an arrayBy operation from the source data. |