Deephaven Data Tailer

The Deephaven Data Tailer is a powerful and flexible service for automating the ingestion of real-time data into your Deephaven environment. Acting as a bridge between file-based data sources and the Data Import Server (DIS), the Tailer continuously monitors directories for new or updated files—such as logs or CSVs—produced by applications, services, or other Deephaven processes. As soon as new data is detected, the Tailer efficiently reads and forwards it for immediate intraday storage and analysis, ensuring your tables are always up to date.

The Tailer is highly configurable: you can run multiple Tailers in parallel, each with its own rules for which files to process and where to route the data. This enables scalable, robust data pipelines that adapt to a variety of operational needs, from simple single-table setups to complex, multi-source environments. Typical deployments run Tailers on the same nodes where data is produced, minimizing latency and maximizing throughput.

Key features include:

  • Automated file discovery and ingestion for both Deephaven binary log files and CSV files
  • Support for partitioned data, enabling high-frequency and time-series workflows
  • Flexible configuration for routing, filtering, and managing multiple data streams
  • Seamless integration with Deephaven's internal partitioning and routing infrastructure

img

Each Tailer instance is uniquely identified—often by a name or numeric ID set at startup—and can be independently managed or restarted. For most users, the default Tailer configuration will be sufficient, but advanced scenarios are supported through custom configuration files and properties.

Tailer configuration consists of three parts: process-level settings, table/log definitions, and destination (routing) configuration. For a full breakdown of these components, see the Tailer configuration section below.

Note

In most cases, you will not need to run additional Tailers or change the Tailer configuration. The data routing configuration defines which DISs handle a given table, and no Tailer configuration is needed.

The default configuration for Deephaven includes a Tailer that operates on every Deephaven node. This Tailer identifies and monitors all files that follow the standard Deephaven filename format, which are located in the standard directories (including /var/log/deephaven/binlogs). You can add additional file locations by using the log.tailer.defaultDirectories or log.tailer.additionalDirectories properties.

Intraday Partitioning

Deephaven partitions all intraday data into separate directories, effectively creating multiple sets of data files distributed across these directories. When querying intraday data or merging it into the historical database, all the intraday partitioning directories are combined into a single table. In most cases, users are not directly aware of this partitioning; however, configuring the Tailer requires a comprehensive understanding of this structure to use it correctly.

Deephaven is highly optimized for appending data to tables. To facilitate this, it is essential that only one data source writes to a specific set of underlying intraday data files at any given time. This is one of the primary reasons for implementing intraday partitioning. Since most real-time data sources utilize Deephaven loggers to generate data, and this data is then processed through the Tailer, it is the Tailer that determines the appropriate internal partition directories for appending the intraday data.

The Tailer establishes two levels of partitioning when sending data to the Data Import Server (DIS). When the Tailer is configured to set up both levels correctly, it ensures that the data is appropriately separated. The DIS will create distinct directories at each level of partitioning, and the lowest-level subdirectories will each contain a set of data files for the table.

  • Internal partitions are the first level of partitioning and typically correspond to a data source. The configuration of the Tailer usually determines the value of the internal partition based on the name of a binary log file, possibly incorporating additional information from its configuration for further differentiation. The schema does not define internal partitions. In some Deephaven tools, such as importers, the internal partition is also called the destination partition.
  • Column partitions divide the data based on the partitioning column specified in the table's schema, a String column that frequently contains a date value. The Tailer determines the column partition value based on a log file name. For date-based column partitioning, the value is usually in yyyy-MM-dd format (e.g., 2017-04-21).

Binary log filename format

The Tailer runs in a configuration where it automatically monitors binary log directories for new files, looking for new partitions and processing them appropriately. To assist in these file searches, binary log filenames must be in a standard format, with several parts separated by periods. Support is provided for filenames in other formats, but this requires additional configuration (explained later).

The default filename format is <Namespace>.<Table Name>.<System or User>.<Internal Partition>.<Column Partition>.bin.<Timestamp>, which has the following components separated by periods:

  • The table's namespace.
  • The table's name.
  • The table's namespace set (System or User).
  • The internal partition value.
  • The column partition value (usually a date).
  • .bin. is the expected separator for binary log files.
  • The date and time the file was created. This enables file sorting.

The following example from a Deephaven internal table illustrates a typical filename:

DbInternal.AuditEventLog.System.vmhost1.2018-07-23.bin.2018-07-23.135119.072-0600
  • DbInternal - the table's namespace.
  • AuditEventLog - the table's name.
  • System - the type of table that, in this case, belongs to a system namespace. User would indicate a user namespace.
  • vmhost1 - the internal partition used to distinguish a data source, usually a host name.
  • 2017-07-23 - the column partition value. This is frequently a date in yyyy-MM-dd format (or another specified format that includes the same fields), but any partitioning method may be used as long as the filename can be used to distinguish the partition.
  • .bin. - the file identifier that distinguishes this as a Deephaven binary log file. The standard identifier is .bin..
  • 2018-07-23.135119.072-0600 - a timestamp to process the files in order. The standard format used in this example is a timestamp down to the millisecond in yyyy-MM-dd.HHmmss.SSS format, followed by timezone offset. The timezone offset is important to correctly sort files during daylight-savings transitions.

Bandwidth Throttling

The Tailer configuration provides the option to throttle the throughput of data sent to specific destinations, or the throughput sent for a specific log entry (per destination). This should only be used if network bandwidth is limited or issues have been encountered due to unexpected spikes in the amount of data being processed by the Tailer. Because it restricts how quickly the Tailer will process binary log files, use of this feature can cause Deephaven to fall behind in getting real-time data into the intraday database. Throttling is optional and will not be applied unless specified.

Two types of throttling are supported:

  • Destination-level throttling is applied to each destination (Data Import Server), and is specified in the data routing configuration for a given data import server.
  • Log-entry-level throttling is applied to each destination for each log entry in the XML file and across all partitions for that log entry. It is specified with the maxKBps attribute in a log entry.

Throttles are always specified in kilobytes per second (KBps). Because the binary log files are sent directly to the Data Import Servers without translation, the size of the binary log files can give an indication of the required bandwidth for a given table. For example, if a particular table's binary log files are sized at 1GB each hour, then an approximate bandwidth requirement could be calculated as follows:

  • Since throttles are specified in KBps, first translate GB to KB: 1GB = 1,024MB = 1,048,576 KB.
  • Next, divide the KB per hour by the number of seconds in an hour: 1,048,576 KB per hour / 3,600 seconds per hour = 291.3KB per second.
  • In this case, setting a 300 KBps throttle would ensure that each hour's data could be logged, but latency will occur when data delivery to the log file spikes above the specified 300 KB/s throttle rate.

Throttling uses a token bucket algorithm to provide an upper bandwidth limit and smoothing of peaks over time. The token bucket algorithm requires, at a minimum, enough tokens for the largest messages Deephaven can send, and this is calculated by the number of seconds (see the log.tailer.bucketCapacitySeconds property) times the KBps for the throttle. Currently, Deephaven sends maximum messages of 2,048KB (2MB), so the number of seconds in the bucket times the bandwidth per second must equal or exceed 2,048K. The software will detect if this is not true and throw an exception that prevents the Tailer from starting.

The bucket capacity also impacts how much bandwidth usage can spike, as the entire bucket's capacity can be used at one time.

  • If log.tailer.bucketCapacitySeconds is 10, and a throttle allows 10 KBps, then the maximum bucket capacity is 100KB, which is not sufficient for Deephaven; the bucket would never allow the maximum 2,048KB message through. Deephaven will detect this and fail.
  • If log.tailer.bucketCapacitySeconds is 60, and a throttle allows 1000KBps, then the bucket's capacity is 60,000K. If a large amount of data was quickly written to the binary log file, the Tailer would immediately send the first 60,000K of data to the DIS. After that the usage would level out at 1,000KBps. The bucket would gradually refill to its maximum capacity once the data arrival rate dropped below 1,000KBps, until the next burst.

Binary Log File Managers

The Tailer must be able to perform various operations related to binary logs.

  • It must derive namespace, table name, internal partition, column partition, and table type values from binary log filenames and the specified configuration log entry.
  • It must determine the full path and file prefix for each file pattern that matches what it is configured to search for.
  • It must sort binary log files to determine the order in which they will be sent.

These operations are handled by Java classes called Binary Log File Managers. Two Binary Log File Managers are provided.

  • StandardBinaryLogFileManager - this provides functionality for binary log files in the standard format described above (<namespace>.<table name>.<table type>.<internal partition>.<column partition>.bin.<date-time>)
  • DateBinaryLogFileManager - this provides legacy functionality for binary log files using date-partitioning in various formats.

The default Binary Log File Manager specified by the log.tailer.defaultFileManager property is used unless the fileManager XML attribute is included.

For the Tailer to function correctly, file names for each log must be sortable. Both Binary Log File Manager types will take into account time zones for file sorting (including daylight savings time transition), assuming the filenames end in the timezone offset (+/-HHmm, for example "-0500" to represent Eastern Standard Time). The standard Deephaven Java logging infrastructure ensures correctly-named files with full timestamps and time zones. Note that the DateBinaryLogFileManager does not consider time zones in determining the date as a column partition value, only in determining the order of generated files; the column partition's date value comes from the filename and is not adjusted.

Tailer configuration

Deephaven property files, Tailer-specific XML files, and data routing configuration control the Tailer's behavior.

  1. Property definitions – Tailer behavior that is not specific to tables and logs is controlled by property definitions in Deephaven property files.
  2. Table and Log definitions – detailed definitions of the binary logs containing intraday data and the tables to which those logs apply are contained in XML configuration files. Each Log entry in these XML files corresponds to a single Binary Log File Manager instance.
  3. Destination (routing) configuration – specifies where the Tailer should send data. This is configured in the data routing configuration, which determines how tables are routed to Data Import Servers (DISs) and other destinations.

Property file configuration parameters

This section details all the Tailer-related properties. Properties may be specified in property files, or passed in as Java command line arguments (e.g., -Dlog.tailer.processes=db_internal).

Configuration file specification

These properties determine the processing of XML configuration files, which are read during Tailer startup to build the list of Logs handled by the Tailer.

PropertyDescriptionExample
log.tailer.configsA comma-delimited list of XML configuration files defining the tables being sent. The classpath is followed to locate the named files.log.tailer.configs=tailerConfigDbInternal.xml (default location: /usr/illumon/latest/etc/tailerConfigDbInternal.xml)
log.tailer.processesA comma-delimited list of processes for which this Tailer will send data. Each process should correspond to a <Process> tag in the XML configuration files. If the list is empty or not provided, the Tailer will run but not handle any binary logs. If it includes an entry with a single asterisk (*), the XML entries for all processes are used.log.tailer.processes=db_internal,customer_logger_1

Tailer properties

These properties specify the name and runtime attributes of the Tailer:

PropertyDescriptionExample
intraday.tailerIDSpecifies the Tailer ID (name) for this Tailer. Usually set by startup scripts on the Java command line rather than in a property file.intraday.tailerID=customer1
log.tailer.enabled.<tailer ID>If this is true, the Tailer will start normally. If false, the Tailer will run without tailing any files.log.tailer.enabled.customer1=true
log.tailer.bucketCapacitySecondsThe capacity in seconds for each bucket used to restrict bandwidth to the Data Import Servers. Applied independently to every throttle. Defaults to 30 seconds if not provided.log.tailer.bucketCapacitySeconds=120
log.tailer.retry.count(Optional) How many times each destination thread will attempt to reconnect after a failure. Default is Integer.MAX_VALUE. Value of 0 means only one connection attempt will be made.
log.tailer.retry.pauseThe pause, in milliseconds, between each reconnection attempt after a failure. Default is 1000 ms (1 second).
log.tailer.poll.pauseThe pause, in milliseconds, between each poll attempt to look for new data in existing log files or find new log files when no new data was found. Lower values reduce latency but increase processing overhead. Default is 100 ms (0.1 second).

File watch properties

The Tailer will use underlying infrastructure to watch for new files to send to its Binary Log File Manager instances for possible tailing.

Caution

The logic used to watch for these files can be changed. However, the log.tailer.watchServiceType property should only be changed under advice from Deephaven Data Labs.

PropertyDescriptionOptions/Examples
log.tailer.watchServiceTypeSpecifies the watch service implementation to use.JavaWatchService (efficient, but not for NFS); PollWatchService (works everywhere, less efficient)

Memory properties

The following properties control the memory consumption of the Tailer and the Data Import Server.

PropertyDescriptionDefault/Example
DataContent.producersUseDirectBuffersIf true, the Tailer allocates direct memory for its data buffers. If changed, adjust JVM args in hostconfig.true; -j -Xmx2g -j -XX:MaxDirectMemorySize=256m
DataContent.consumersUseDirectBuffersIf true, the Data Import Server uses direct memory for its data buffers.false
BinaryStoreMaxEntrySizeSets the max size in bytes for a single data row in a binary log file. Affects buffer sizes.1024 * 1024
DataContent.producerBufferSize.userBuffer size in bytes for each User table data stream.256 * 1024
DataContent.producerBufferSizeBuffer size in bytes for each System table data stream.256 * 1024
DataContent.consumerBufferSizeBuffer size in bytes for the Data Import Server. Must be large enough for a producer buffer plus a full binary row.2 * max(DataContent.producerBufferSize, BinaryStoreMaxEntrySize + 4)
DataContent.userPoolCapacityMax number of user table locations processed concurrently.128
DataContent.systemPoolCapacityMax number of system table locations processed concurrently.128
DataContent.disableUserPoolIf true, user table locations are processed without a constrained pool.false
DataContent.disableSystemPoolIf true, system table locations are processed without a constrained pool.false

Note

The Tailer allocates two pools of buffers, one for user tables and one for system tables. Each item in that pool requires two buffers for concurrency, so the memory required will be double the buffer size times the pool capacity. Total memory required for the Tailer is approximately 2 * (DataContent.producerBufferSize * DataContent.systemPoolCapacity + DataContent.producerBufferSize.user * DataContent.userPoolCapacity).

Miscellaneous properties

The following additional properties control other aspects of the Tailer's behavior.

PropertyDescriptionExample/Default
log.tailer.defaultFileManagerSpecifies which Binary Log File Manager is used for a log unless the log's XML configuration entry specifies otherwise.log.tailer.defaultFileManager=com.illumon.iris.logfilemanager.StandardBinaryLogFileManager
log.tailer.defaultIdleTimeIf the Tailer sends no data for a namespace/table/internal partition/column partition combination for a certain amount of time (the "idle time"), it will terminate the associated threads and stop monitoring that file for changes. Format: HH:mm or HH:mm:ss.Usually set for 01:10
log.tailer.defaultDirectoriesOptional, comma-delimited list of directories where this Tailer should look for log files. If not specified, defaults are used from iris-defaults.prop.log.tailer.defaultDirectories=/Users/app/code/bin1/logs,/Users/app2/data/logs
log.tailer.logDetailedFileInfoWhether the Tailer logs details on every file every time it looks for data. Default is false (only logs when new files are found).log.tailer.logDetailedFileInfo=false
log.tailer.additionalDirectoriesOptional, comma-delimited list of directories appended to those in log.tailer.defaultDirectories. Use to specify additional directories.log.tailer.additionalDirectories=/Users/app/code/bin1/logs,/Users/app2/data/logs
log.tailer.logBytesSentWhether the Tailer logs information every time it sends data to a destination. Default is true.log.tailer.logBytesSent=true
log.tailer.startupLookbackTimeThe amount of time (format HH:mm or HH:mm:ss) the Tailer should look back for files when starting. Used to ensure all data is sent after restarts.Usually set for 01:10
log.tailer.fileCleanup.enabledWhether binary log files are deleted after all data has been sent to all destinations. Default is true for Kubernetes, otherwise false.log.tailer.fileCleanup.enabled=true
log.tailer.fileCleanup.deleteAfterMinutesMinutes after a tailed log file is last modified before deletion if log.tailer.fileCleanup.enabled=true. Defaults to 240.log.tailer.fileCleanup.deleteAfterMinutes=30
log.tailer.completedFileLogFull path of a CSV file listing binary files that have been fully processed and may be deleted. The CSV file has two columns: LogTimestamp (ISO 8601) and Filename (absolute path).log.tailer.completedFileLog=/var/log/deephaven/tailer/completed-logs.txt

Date Binary Log File Manager parameters

Date Binary Log File Manager parameters

PropertyDescriptionExample/Default
log.tailer.timezoneOptional property specifying the time zone for log processing. If not specified, the server's timezone is used. Used in calculating path patterns for date-partitioned logs and tables.log.tailer.timezone=TZ_NY
log.tailer.internalPartition.prefixPrefix for the Tailer's internal partition, used by DateBinaryLogFileManager to determine the start of the internal partition name. If not provided, the Tailer uses the server name or defaults to localhost.log.tailer.internalPartition.prefix="source1"

Tailer XML configuration files

Details of the logs to be tailed are supplied in XML configuration files.

Configuration file structure

Each Tailer XML configuration file must start with the <Processes> root element. Under this element, one or more <Process> elements should be defined. All <Process> elements in all configuration files with the same name will be combined.

<Process> elements define processes that run on a server, each with a list of <Log> entries which define the Binary Log File Manager details for the Tailer. Each <Process> element has a name that identifies a group of logs. The log.Tailer.processes property determines which <Process> names a Tailer will process, according to its XML configuration files. For example, the following entry specifies two Binary Log File Managers for the db_internal process; a Tailer that includes db_internal in its log.tailer.processes property would use these Binary Log File Managers. Each parameter is explained below.

<Processes>
    <Process name="db_internal">
        <Log filePatternRegex="^DbInternal\.(?:[^\.]+\.){2}bin\..*"
             tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"
             namespace="DbInternal"
             internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"
             path=".bin.$(yyyy-MM-dd)"
             fileManager="com.illumon.iris.logfilemanager.DateBinaryLogFileManager" />
      <Log fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager" />
    </Process>
</Processes>

If the <Process> name is a single asterisk (*), every Tailer will use that Process element's Log entries regardless of its log.tailer.processes value.

Each <Log> element generates one Binary Log File Manager instance to handle the associated binary log files. Binary log files are presented to each Binary Log File Manager in the order they appear in the XML configuration files. Once a file manager claims the file, it is not presented to any other file managers.

Binary Log File Manager types

Each <Log> element specifies information about the binary log files that a Tailer processes. The attributes within each <Log> entry vary depending on the type of Binary Log File Manager chosen. The Log Element Attributes section provides full details on each attribute.

StandardBinaryLogFileManager and DateBinaryLogFileManager both allow the following optional attributes to override default Tailer properties; each is explained in detail later.

  • idleTime changes the default idle time for when threads are terminated when no data is sent.
  • fileSeparator changes the default file separator from .bin..
  • logDirectory or logDirectories specifies directories in which the Binary Log File Manager will look for files.
  • excludeFromTailerIDs restricts a Log entry from Tailers running with the given IDs.
  • maxKBps specifies a Log-level throttle.

Standard Binary Log File Manager

The StandardBinaryLogFileManager looks for files named with the default <namespace>.<table name>.<table type>.<internal partition>.<column partition>.bin.<date-time> format. It will be the most commonly used file manager, as it automatically handles files generated by the common Deephaven logging infrastructure.

  • The namespace, table name, table type, internal partition, and column partition values are all determined by parsing the filename.
  • The optional namespace or namespaceRegex attributes can restrict the namespaces for this Binary Log File Manager.
  • The optional tableName or tableNameRegex attributes can restrict the table names for this Binary Log File Manager.
  • The optional tableType attribute restricts the table type for this Binary Log File Manager.

Date Binary Log File Manager

The DateBinaryLogFileManager requires further attributes to find and process files. It provides many options to help parse filenames, providing compatibility for many different environments.

  • A path attribute is required, and must include a date-format conversion. It may also restrict the start of the filename.
  • If path does not restrict the filename, a filePatternRegex attribute should be used to restrict the files found by the Binary Log File Manager.
  • A namespace or namespaceRegex attribute is required to determine the namespace for the binary log files.
  • A tableName or tableNameRegex attribute is required to determine the table name for the binary log files.
  • The optional internalPartitionRegex attribute can be specified to indicate how to determine the internal partition value from the filename. If the internalPartitionRegex attribute is not specified, the internal partition will be built from the log.tailer.internalPartition.prefix property or server name, and the internalPartitionSuffix attribute.
  • The column partition value is always determined from the filename, immediately following the file separator.
  • The table type is taken from the optional tableType attribute. If this is not specified, then the table type is System.

Custom Binary Log File Managers

The existing Binary Log File Managers will cover most use cases; in particular, the StandardBinaryLogFileManager will automatically find and tail files that use the standard Deephaven filename conventions. If you need additional functionality, you can create a custom BinaryLogFileManager and specify it in the tailer's XML configuration file. For example, if files from before today are needed during tailer startup, a custom BinaryLogFileManager would be required.

See the Javadoc for com.illumon.iris.logfilemanager.BinaryLogFileManager for details on how to do this. It may be easier to extend the StandardBinaryLogFileManager class as a starting point.

Log Element Attributes

Unless otherwise specified, all parameters apply to both regular-expression entries and fully-qualified entries. These are all attributes of a <Log> element.

AttributeDescription
excludeFromTailerIDsExcludes this log from the specified Tailer IDs. Example: excludeFromTailerIDs="CustomerTailer1"
fileManagerFully-qualified Java class implementing Binary Log File Manager for this Log. Example: fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager". If not specified, property log.tailer.defaultFileManager determines the class to use.
filePatternRegexRegular expression that restricts file matches for a date Binary Log File Manager. Example: filePatternRegex="^[A-Za-z0-9_]*\.stats\.bin\.*"
fileSeparatorSpecifies the separator to be used if files do not include the standard .bin. separator. Example: fileSeparator="."
idleTimeTime (in HH:mm or HH:mm:ss) after which threads created by this manager will be terminated if no data has been sent. Example: idleTime="02:00"
internalPartitionRegexFor date Binary Log File Managers, regex applied to filename to determine the internal partition. Example: internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"
internalPartitionSuffixSuffix to be added to the internal partition prefix. Example: internalPartitionSuffix="DataSource1"
logDirectorySpecifies a single directory for log files. Example: logDirectory="/usr/app/defaultLogDir"
logDirectoriesComma-delimited list of directories to search for files. Example: logDirectories="/usr/app1/bin,/usr/app2/bin"
maxKBpsMaximum rate (in KB/s) at which data can be sent to each Data Import Server. Example: maxKBps="1000"
namespaceRestricts/defines the namespace for this log entry. Example: namespace="CustomerNamespace"
namespaceRegexRegex to determine namespace from filename. Example: namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*"
pathFor date Binary Log File Managers, the filename (not including directory) for the log. Example: path="CustomerNamespace.Table1.bin.$(yyyy-MM-dd)" or path=".bin.$(yyyy-MM-dd)"
tableNameRestricts/defines the table name for this log entry. Example: tableName="Table1"
tableNameRegexRegex to determine table name from filename. Example: tableNameRegex="^(?:[^\.]+\.){0}([^\.]+).*"
tableTypeRestricts/defines the table type for this log entry. Example: tableType="User"

Regular expression example

It may be helpful to examine a standard set of regular expressions used in common Tailer configurations. A common filename pattern will be <namespace>.<tablename>.<host>.bin.<date/time>; the host will be used as the internal partition value. The following example filename provides an example of this file format:

EventNamespace.RandomEventLog.examplehost.bin.2017-04-01.162641.013-0600

In this example, the table name can be extracted from the filename with the following regular expression:

tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

This regular expression works as follows:

  • ^ ensures the regex starts at the beginning of the filename.
  • (?: starts a non-capturing group, which is used to skip past the various parts of the filename separated by periods.
  • [^\.]+ as part of the non-capturing group; matches any character that is not a period, requiring one or more matching characters
  • \. as part of the non-capturing group; matches a period.
  • ){1} ends the non-capturing group and causes it to be applied 1 time. In matches for other parts of the filename, the 1 will be replaced with how many periods should be skipped; this causes the regex to skip the part of the filename up to and including the first period.
  • ([^\.]+) defines the capturing group, capturing everything up to the next period, which in this case captures the table name.
  • .* matches the rest of the filename, but doesn't capture it.

For this example, the three attributes that would capture the namespace, table name, and internal partition are very similar, only differing by the number that tells the regex how many periods to skip:

  • namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*"

  • tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

  • internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"

The above example could result in the following regular expression Log entry, which would search for all files that match the <namespace>.<tablename>.<host>.bin.<date/time> pattern in the default log directories:

<Log filePatternRegex="^[A-Za-z0-9_]+\.[A-Za-z0-9_]+\.[A-Za-z0-9_]+\.bin\..*" namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*" tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*" runTime="25:00:00" path=".bin.$(yyyy-MM-dd)" />

Direct-to-DIS CSV Tailing

Deephaven supports tailing CSV files directly to the Data Import Server. To configure CSV tailing:

  1. Create and properly configure a schema with an ImportSource element. This schema may be handwritten or generated via Schema Discovery.
  2. Create and configure a Tailer configuration file in an appropriate customer-controlled location, such as /etc/sysconfig/illumon.d/resources.
    1. Mark the process as CSV format by adding an attribute fileFormat="CSV".
    2. Add a new tag <Metadata> to provide the DIS with important information needed to properly process the CSV files.
  3. Configure tailer properties to find the new configuration file and recognize the new process name by editing iris-environment.prop. In a stanza appropriate to your tailer (e.g., [service.name=tailer1|tailer1_query|tailer1_merge], marked with the comment # For all tailer processes):
  4. Set log.tailer.configs to include your configuration file. E.g., log.tailer.configs=tailerConfigDbInternal.xml,newTailerConfig.xml
  5. Set log.tailer.processes to include your process name (or use db_internal). E.g., log.tailer.processes=db_internal,csv_data.

In the Tailer configuration file, the <Metadata> tag should contain a collection of <Item name="name" value="value"/> entries. The valid Items are:

  • importSourceName - (Required) The name of a valid ImportSource from the table’s schema. The DIS will use this description to parse the CSV.
  • charset - (Optional) The charset encoding of the files. If omitted, this defaults to UTF-8.
  • format - (Optional) The format of the CSV file. This may be any one of the supported CSV formats except BPIPE (see Importing CSV files).
  • hasHeader - (Optional) Indicates whether the file has a header line. This defaults to true.
  • delimiter - (Optional) The delimiter between CSV values. If omitted, this uses the delimiter of the specified format.
  • trim - (Optional) Specify that CSV fields should be trimmed before they are parsed.
  • rowsToSkip - (Optional) The number of rows (excluding the header) to skip from each file tailed.
  • constantValue - (Optional) - The value to use for constant fields.

Once these have been configured and the Tailer started, any CSV files produced in the specified paths will be tailed directly to the DIS.

An example follows:

<Processes>
    <Process name="csv_data">
        <Log name="MyTableName"
            namespace="MyNamespace"
            fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager"
            logDirectory="/path/to/your/data"
            idleTime="04:00"
            fileSeparator=".csv."
            fileFormat="CSV">
           <Metadata>
             <Item name="importSourceName" value="IrisCSV"/>
             <Item name="trim" value="false"/>
             <Item name="hasHeader" value="true"/>
             <Item name="constantValue" value="SomeValue"/>
           </Metadata>
        </Log>
    </Process>
</Processes>

This simpler example will make the default Tailer pick up CSV files in /var/log/deephaven/binlogs:

<Processes>

    <Process name="db_internal">
        <Log fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager"
            fileSeparator=".csv."
            fileFormat="CSV">
           <Metadata>
             <Item name="importSourceName" value="IrisCSV"/>
           </Metadata>
        </Log>
    </Process>

</Processes>

Transactional processing

In many cases, it is desirable to process sets of rows as a transaction. There are two different methods for supporting this paradigm.

Transaction column

The first and best method is to designate a specific column in the CSV as a “Transaction” column. This allows the source writing the CSV to control when transactions are started and completed. To enable this mode of processing, you must add the rowFlagsColumn metadata item to the Tailer configuration file.

<Item name=”rowFlagsColumn” value=”rowFlags”/>

This will designate a column “rowFlags” as a special column. Transactions are started by writing a row with StartTransaction in the “rowFlags” column. Subsequent rows that are part of the transaction should have “None” written to the “rowFlags” column, and to end the transaction, write a row with “EndTransaction” in the “rowFlags” column.

You may write “SingleRow” for rows that should be logged independently, not as part of a transaction.

Caution

While you are in a transaction, writing “SingleRow” or “StartTransaction” will abort the transaction in progress!

Automatic transactions

In some cases, the format of the CSV may not be changeable. In these cases, you may use automatic transactions to accomplish a similar goal.

Caution

Automatic transactions are not recommended unless they are absolutely necessary.

Automatic transactions ensure that every row logged is part of a transaction. When a new file is encountered or no rows have been received within the timeout window, the current transaction is completed, and subsequent rows begin a new transaction.

To enable automatic transactions, add the following setting to the Tailer configuration metadata:

<Item name="transactionMode" value="NewFileDeadline"/> <Item name="transactionDeadlineMs" value="30000"/>

The first setting enables automatic transactions, while the second specifies the inactivity timeout in milliseconds for which the system will automatically complete a transaction.

Running a remote Tailer on Linux

The Deephaven log Tailer (LogTailerMain class) should ideally run on the same system where log files are generated. It is strongly recommended that the storage used by the logger and Tailer processes is fast, local storage, to avoid any latency or concurrency issues that might be introduced by network filesystems.

Setup process

To set up an installation under which to run the LogTailerMain class:

  1. Install Java SE runtime components, including the Java JDK (devel package). The Tailer can sometimes be run with a JRE, but if advanced non-date partition features or complex data routing filtering are used, a JDK will be required. This should match the version of Java installed on the Deephaven cluster servers. The presence and version of a JDK can be verified with javac -version.
  2. Create a directory for a Tailer launch command, configuration files, and workspace and logging directories. This directory must be accessible for read and write access to the account under which the Tailer will run.
  3. Obtain the JAR and configuration files from a Deephaven installation (either copied from a server, or installed and/or copied from a client installation using the Deephaven Launcher), or use the Deephaven Updater to synchronize these from the Client Update Service. The Deephaven Updater is the recommended approach, as it allows automation of runtime updates for headless systems.
cd /home/mytaileraccount
wget "https://mydhserver.mydomain.com:8443/DeephavenLauncher_9.02.tar"
tar xf DeephavenLauncher_9.02.tar
/home/mytaileraccount/DeephavenLauncher/DeephavenUpdater.sh Deephaven https://mydhserver.mydomain.com:8443/iris/

Note

The current version of the DeephavenLauncher tar will vary from release to release, and the actual URL will vary with your installation. See Find the Client Update Service Url to determine the correct values.

Deephaven in the above example is the friendly name for this instance. Any name can be used, but a name without spaces or special characters will be easier to use in scripts.

The Deephaven Updater should be re-run periodically to ensure the Tailer has up-to-date binaries and configuration from the Deephaven server. We recommend running it before each Tailer startup as part of the Tailer start script.

  1. Create the following directories, and ensure the account under which the Tailer will run has access to them:
  • /var/log/deephaven/tailer - the process logging path. It is sufficient, and recommended, to create /var/log/deephaven and grant the Tailer account read/write access to this path. An alternate logging path can be configured with a custom value passed to the -DlogDir JVM argument when starting the Tailer.
  • /var/log/deephaven/binlogs - the root path for binary log files, and the recommended location for local loggers to write their log files.
  • /var/deephaven/run - the account will need read/write access here to create and delete PID files (can be overridden with a custom JVM argument -DpidFileDirectory).
  • /var/deephaven/binlogs/pel and /var/deephaven/binlogs/perflogs - these directories are default monitored paths by all Tailers; other Deephaven processes may use them on the remote Tailer host.
  1. Create a launch command to provide the correct classpath, properties file, and other installation-specific values needed to run the Tailer process.
#!/bin/bash
java -cp "/home/mytaileraccount/iris/.programfiles/Deephaven/resources":"/home/mytaileraccount/iris/.programfiles/Deephaven/java_lib/*":"/home/mytaileraccount/iris" \
-server -Xmx4096m -DConfiguration.rootFile=iris-common.prop \
-Dworkspace="/home/mytaileraccount/Tailer/workspaces" \
-Dlog.tailer.configs=tailer.xml \
-Ddevroot="/home/mytaileraccount/iris/.programfiles/Deephaven/java_lib/" \
-Dprocess.name=tailer \
-Dservice.name=iris_console \
-Dlog.tailer.processes=monitoring \
-Ddh.config.client.bootstrap=/home/mytaileraccount/iris/.programfiles/Deephaven/dh-config/clients \
-Dintraday.tailerID=1 com.illumon.iris.logtailer.LogtailerMain

Note

The classpath paths include Deephaven configuration files (resources), jars (java_lib/*), a root path for other configuration (Deephaven), and, in this example, the directory that contains the Tailer config XML file (iris).

  1. Configure startup for the Tailer process.
[Unit]
Description=Deephaven Tailer 1
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service
StartLimitIntervalSec=100
StartLimitBurst=5

[Service]
User=mytaileraccount
Group=mytailergroup
ExecStart=/home/mytaileraccount/iris/tailer-start.sh
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target
sudo systemctl enable tailer1.service
sudo systemctl start tailer1.service

Management and logging

The remote Tailer can be managed using systemd commands:

sudo systemctl status tailer1.server
sudo systemctl stop tailer1.server

Logging for the Tailer process will be under /var/log/deephaven/tailer, unless an alternate path is passed to the -DlogDir JVM argument.

Running a remote Tailer on Windows

The Deephaven log Tailer (LogTailerMain class) should ideally be run on the same system where log files are being generated. Since this class has no dependency on *nix filesystem or scripting functionality, it can be run under Java as a Windows process or service when a Windows system is generating binary log data.

Requirements

To set up an installation under which to run the LogTailerMain class:

  1. Install Java SE runtime components or use the Launcher installation that includes the Java JDK. The Tailer can sometimes be run with a JRE, but if advanced non-date partition features or complex data routing filtering are used, a JDK will be required.
  2. Create a directory for a Tailer launch command, configuration files, and workspace and logging directories. This directory must be accessible for read and write access to the account under which the Tailer will run.
  3. Create the following directories, and ensure the account under which the Tailer will run has access to them:
  • \var\log\deephaven\tailer - the process logging path. It is sufficient, and recommended, to create \var\log\deephaven and grant the Tailer account read\write access to this path. An alternate logging path can be configured with a custom value passed to the -DlogDir JVM argument when starting the Tailer.
  • \var\log\deephaven\binlogs - the root path for binary log files, and the recommended location for local loggers to write their log files.
  • \var\deephaven\run - the account will need read\write access here to create and delete PID files (can be overridden with a custom JVM argument -DpidFileDirectory).
  • \var\deephaven\binlogs\pel and \var\deephaven\binlogs\perflogs - these directories are default monitored paths by all Tailers; other Deephaven processes may use them on the remote Tailer host.
  1. Obtain the JAR and configuration files from a Deephaven installation (either copied from a server, or installed and/or copied from a client installation using the Deephaven Launcher), or use the Deephaven Updater to synchronize these from the Client Update Service. We recommend using the Deephaven Updater as it can automate runtime updates for headless systems.

  2. Create a launch command to provide the correct classpath, properties file, and other installation-specific values needed to run the Tailer process.

  3. Configure startup for the Tailer process.

Note

Since Windows does not support inotify, Tailers on Windows will need to be configured to use the PollWatchService.

Example Directory

img

This directory was created under Program Files, where administrative rights will be needed by the account under which the Tailer will run. In this example, the Tailer is being run as a service under LocalSystem, and the Deephaven resources and jars were manually synchronized to this path. Specific accounts with more limited rights than LocalSystem would also be valid. The key requirements are that the account must be able to write to the logs and workspace directories, must be able to read from all files to be tailed and all configuration and JAR files, and must have sufficient network access to contact the DIS.

Note that this directory contains a TailerConfig XML file, which specifies the Tailer's configuration (i.e., which binary log files it will tail) and a property file, which specifies other Tailer configuration options, including where the Tailer will send the data. For full details on Tailer configuration, see Tailer Configuration. This directory also contains a logs and a workspace directory, which can be created empty, and a java_lib folder that was copied, with its JAR files, from the directory structure set up by the Deephaven Launcher client installer.

As new releases of Deephaven are applied to an environment, the JAR files in java_lib must be updated so that any new Tailer or Tailer support functionality is also available to Windows-based Tailer processes. The Deephaven Updater makes it simple to automate this update.

We recommend running the Updater before each Tailer startup as part of the Tailer start script.

Launch Command

The launch command in this example is a Windows cmd file. It contains (effectively) one line (^ is the line continuation character for Windows .bat and .cmd files):

java -cp^
 "[instance_root]\[instance_name]\resources";"[instance_root]\[instance_name]\java_lib\*";"[location_of_tailerConfig.xml_file]"^
 -server -Xmx4096m^
 -DConfiguration.rootFile=iris-common.prop^
 -Dworkspace="[instance_root]\[instance_name]\workspaces"^
 -Ddevroot="[instance_root]\[instance_name]\java_lib"^
 -Dprocess.name=tailer^
 -Dservice.name=iris_console^
 -Dlog.tailer.processes=monitoring^
 -Ddh.config.client.bootstrap="[instance_root]\[instance_name]\dh-config\clients"^
 -Dintraday.tailerID=1 com.illumon.iris.logtailer.LogtailerMain

The components of this command follow:

  • java - Must be in the path and invoke Java matching the version used by the Deephaven server installation.
  • -cp "[instance_root]\[instance_name]\resources";"[instance_root]\[instance_name]\java_lib\*";"[location_of_tailerConfig.xml_file]" - The class path to be used by the process.
    • The first part is the path to configuration files used for the Deephaven environment, such as properties files and the data routing YAML file.
    • The second part is the path to the JAR files from the Deephaven server or client install.
    • The third part points to the directory that contains the configuration XML file that the process will need.
  • -server - Run Java for a server process.
  • -Xmx1024m - How much memory to give the process. This example is 1GB, but smaller installations (tailing less files concurrently) will not need this much memory. 1GB will be sufficient for most environments but if a lot of logs are tailed or data throughput is fairly high this may need to be increased.
  • -DConfiguration.rootFile=iris-common.prop - The initial properties file to read. This file can include other files, which must also be in the class path.
    • iris-common.prop is the default unified configuration file for Deephaven processes.
    • It is also possible to specify or override many Tailer settings by including other -D entries instead of writing them into the properties file.
  • -Dworkspace="[instance_root]\[instance_name]\workspaces" - The working directory for the process.
  • -Ddevroot="[instance_root]\[instance_name]\java_lib" - Path to the Deephaven binaries (JARs). Same as the first half of the class path in this example.
  • -Dprocess.name=tailer - The process name for this process.
  • -Dservice.name=iris_console - Use stanzas from properties files suitable for a remote client process.
  • -Dlog.tailer.processes=[comma-separated list] - Process names to be found in the Tailer config XML files.
  • -Ddh.config.client.bootstrap="[instance_root]\[instance_name]\dh-config\clients" - specifies where to find configuration server host and port details.
  • -Dintraday.tailerID=1 - ID for the Tailer process that must match Tailer instance specific properties from the properties file.
  • com.illumon.iris.logtailer.LogtailerMain - The class to run.

Other properties that could be included, but are often set in the properties file instead:

  • -Dlog.tailer.enabled.1=true
  • -Dlog.tailer.configs=tailerConfig.xml
  • -DpidFileDirectory=c:/temp/run/illumon

Also, if logging to the console is desired (e.g., when first configuring the process, or troubleshooting), -DLoggerFactory.teeOutput=true will enable log messages to be teed to both the process log file and the console.

Warning

This should not be set to true in a production environment. It will severely impact memory usage and performance.

If the pidFileDirectory is not overridden, the process will expect to find C:\var\run\illumon in which to write a pid file while it is running.

The process will also expect C:\var\log\deephaven\binlogs\pel and C:\var\log\deephaven\binlogs\perflogs to exist at startup. These directories should be created as part of the Tailer setup process.

Automate Tailer execution

There are several options to automate the execution of the LogTailerMain process. Two possibilities are the Windows Task Scheduler or adding the process as a Windows Service.

The Task Scheduler is a Windows feature that can be used to automatically start, stop, and restart tasks. Previous versions of the Tailer process needed to be restarted each day, but current versions can be left running continuously.

An easy way to configure the launch command to run as a Windows Service is to use NSSM (The Non-Sucking Service Manager). This free tool initially uses a command line to create a new service:

nssm install "service_name"

The tool will then launch a UI to allow the user to browse to the file to be executed as a service (e.g., runTailer.cmd in this example), and to specify other options like a description for the service, startup type, and the account under which the service should run. Once completed, this will allow the Tailer to be run and managed like any other Windows service.

img