Managing Data from Different Regions/Time Zones
Deephaven's flexibility and scalability provides users with an array of choices when ingesting and querying time series data from different regions. There are three primary concerns to address when handling data from different time zones:
- Partitioning of data by date.
- Writing queries that access the appropriate partitions for each region.
- Determining acceptable maintenance windows.
The precise definition of a "region" is determined by each customer. In most cases, data from different time zones across the globe are produced by different sets of processes that restart daily and do not run across midnight in the time zone for which they log data. In general, a region refers to a group of time zones whose data is handled by the same processes. For example, an "Asia" region might consist of data from both Hong Kong and Tokyo, even though they are not in the exact same time zone.
To demonstrate the available options, consider a hypothetical system that must ingest trade data from three regions:
- North America — from 7:00am to 6:00pm EDT (11:00am UTC to 10:00pm UTC)
- Europe — from 7:00am to 6:00pm GMT (7:00am UTC to 6:00pm UTC)
- Asia — from 7:00am to 6:00pm HKT (11:00pm UTC on the previous date to 10:00am UTC)
Note
The document's code examples use the Legacy API, but the concepts also apply to Core+.
Date Partitions with Multiple Regions
Deephaven tables comprise multiple column partitions, where each partition typically contains data from a different date. When setting up tables that will contain data from different regions (i.e., different time zones), choices must be made about whether to isolate or aggregate data across regions as well as which date to use when logging data for each region.
Performance Concerns
There is no single best choice for how to partition data: the optimal architecture depends on the size of the data, acceptable maintenance windows, and the existing systems that produce the data.
Whenever possible, Deephaven recommends the partition name for a given dataset correspond to the local date of the data. This allows data to be filtered to a date range based solely on the partition name rather than an attribute of the data itself (such as a timestamp). Filtering based on partition is much faster than filtering the data by row: metadata indexing allows constant-time filtering when exactly matching on partition names (such as filtering to specific dates), and the time to run more complex filters on partitions grows with the cardinality of the set of partitions, rather than number of rows in the data itself.
The following section details several options for how datasets from different regions can be stored in Deephaven tables.
Option 1: One Table per Region
The most straightforward way to set up multiple regions is to log data from different regions to different tables. Because each region is handled separately, each table can be partitioned using dates from the time zone that best pertains to the data. It is the responsibility of queries to ensure they query data for each region, and to ensure they pull data for the appropriate dates each time the query is restarted. Historical data merges can run for each region as soon as the loggers for that region have stopped producing new data.
For the hypothetical system described above, three different tables would be generated:
- one for trades from North America (named "TradesNA", for example),
- one for trades from Europe ("TradesEurope"), and
- one for trades from Asia ("TradesAsia").
This approach is beneficial to queries that are only interested in data from one region. While this will introduce additional overhead when querying across regions - for combining the tables with the merge()
function - this overhead is negligible in the majority of cases.
Option 2: One Table, Partitioned by Local Time Zone
It is also possible to ingest data from different regions into a single table. Data from different regions can be logged using the local date in each region. The historical data merge cannot run until the data feeds have finished in all regions.
For example, under this partitioning scheme, if a system in Asia generates data on 2019-03-15 from 7:00am HKT to 6:00pm HKT (11:00pm UTC on 2019-03-14 to 10:00am UTC on 2019-03-15), all of the data from Asia would be logged to the 2019-03-15 partition.
Note that depending on the hours during which data is generated, choosing partitions this way may result in each date partition containing data for more than 24 hours. (If the Asia data feed started two hours earlier, at 5:00am HKT/9:00pm UTC, then the Asia feed for 2019-03-15 would start before the North America feed for 2019-03-14 had finished.)
Option 3: One Table, Partitioned with a Common Time Zone (e.g., UTC)
Another option is to ingest data from different regions into a single table, where partitioning dates are chosen using a single, consistent time zone. For example, when ingesting the hypothetical trade data described above for the 2019-03-15 trading day, data would be ingested to two partitions:
- Data from 7:00am HKT until 8:00am HKT (23:00 UTC on 2019-03-14 to 00:00 UTC on 2019-03-15) would be ingested to the partition for 2019-03-14.
- The rest of the data — from 8:00am HKT/00:00 UTC until the North America region finishes at 6:00pm EDT/20:00 UTC — would be ingested to the partition for 2019-03-15.
When partitioning a table in this way, the historical data merge can be run shortly after midnight UTC because no additional data will be logged for the previous date. (It should not run at midnight exactly, because it will take a moment to ingest any data logged immediately before midnight.)
Since data for each local date is stored under multiple column partitions, querying this data for a specific date would require filtering on both date (to select the two column partitions under which data for one date is stored) and time (to include only data from one day, rather than all the data under both partitions).
Note: Because this partitioning scheme divorces partition names from the local date of a given data point, it is recommended to avoid this partitioning scheme when possible, particularly with very large datasets such as market quotes or trades.
Querying Data Across Regions
There are two concerns when writing queries accessing data for multiple regions:
- accessing the correct date partition for each region, and
- ensuring consistency when persistent queries are restarted.
When data from different regions is stored in different tables, the tables for each region should be queried separately, then merged into one table:
tradeDate = "2019-03-15"
tradesAsia = db.i("MarketData", "TradesAsia").where("Date=tradeDate")
tradesEurope = db.i("MarketData", "TradesEurope").where("Date=tradeDate")
tradesNA = db.i("MarketData", "TradesNorthAmerica").where("Date=tradeDate")
tradesAll = merge(tradesAsia, tradesEurope, tradesNA)
If a single table is used for all regions, using the local date for partitioning, then queries will only need to access a single table:
tradeDate = "2019-03-15"
tradesAll = db.i("MarketData", "Trades").where("Date=tradeDate")
If data is partitioned using a consistent time zone, such as UTC, then a filter for both date and time may be required since the Asia system generates data for multiple dates using UTC:
tradeDateFilter = "Date in `2019-03-14`, `2019-03-15`"
startTime = "2019-03-14T23:00 UTC"
endTime = "2019-03-15T22:00 UTC"
timeFilter = "inRange(Timestamp, startTime, endTime)"
tradesAll = db.i("MarketData", "Trades").where(dateFilter, timeFilter)
Ensuring Consistency Across Query Restarts
Persistent queries that are scheduled to run daily (such as those supporting trading strategies or dashboard components) are restarted each day to pick up data for the next date. They should be written in such a way so they retrieve the same data if restarted during their scheduled hours.
For example, a persistent query using the hypothetical trade data described above might be scheduled to run from 23:00 UTC until 22:00 UTC the next day (i.e., from when the Asia data feed starts until the North America data feed stops). Accordingly, such a query should pull the same data if it is restarted at any time between its scheduled start and end times.
The following code demonstrates how a query can use the built-in London Stock Exchange calendar (which uses Coordinated Universal Time) to force a query to roll over to the next date only when initialized at or after 23:00 UTC:
// Get the current time
queryInitTime = currentTime()
// Check whether it is past 11pm UTC
// If so, roll over to the next date
// If not, use the current date
isLocalDateAheadOfCurrentDate = hourOfDay(queryInitTime, TZ_UTC) >= 23
if(isLocalDateAheadOfCurrentDate) {
queryDate = calendar("LSE").nextDay()
} else {
queryDate = calendar("LSE").currentDay()
}
// Get the correct date for "yesterday"
queryDateYest = calendar("LSE").previousDay(queryDate)
// Retrieve data from the appropriate dates
tradesYest = db.t("Market", "Trade").where("Date=queryDateYest")
tradesToday = db.i("Market", "Trade").where("Date=queryDate")
Maintenance Windows
In addition to restarting queries daily to roll over the date, Deephaven recommends restarting the following processes at least once weekly (e.g., on a Saturday).
- Data Import Server -
sudo monit restart db_dis
- Local Table Data Server -
sudo monit restart db_dis
The restart will only take a minute, but queries that depend on intraday data may fail during this window. See Operating Deephaven 24X6 for additional information on maintenance windows.