Deephaven Data Tailer

The tailer is a Deephaven process that looks for files generated by application and Deephaven processes, and then sends the data contained in these files to the Data Import Server (DIS) for intraday storage. The process of reading thes files is referred to as "tailing" as it is analogous to the Linux tail command. One or more tailers can run at the same time, with each tailer's configuration specifying the files it tails and where its data is sent. All data is loaded into the appropriate internal partition based on the tailer's configuration and the log file names, and the column partition information is always embedded in the log file names. The tailer currently supports Deephaven binary log files and CSV files.

img

Each tailer is given a name, which is usually specified through a Java parameter when the tailer is started. For example, on a standard Linux Deephaven installation the command sudo /usr/illumon/latest/bin/iris start tailer 1 starts a tailer called "1". This name is used in various configuration parameters and does not need to be numeric.

It is generally recommended that a tailer be run on the nodes on which data is produced as this provides the best performance for finding and processing files. A tailer can process many different tables simultaneously, as well as send data to different destinations. One tailer is generally enough, but it is possible to run more than one tailer on any node to enable independent restarts for different configurations.

Tailer configuration consists of three parts. Process-level configuration, such as the tailer's name, is handled through properties retrieved from Deephaven property files. Configuration related to specific tables and logs for a tailer is retrieved from tailer-specific XML files. Destination configuration (where the tailer should send data) is specified in the data routing configuration.

Tailer configuration is detailed in the Property File Configuration Parameters and Tailer XML Configuration Parameters sections.

There is also a Quick Start Guide to setting up a simple new tailer that looks for one table.

Intraday Partitioning

Deephaven partitions all intraday data into separate directories, in effect creating multiple sets of data files spread across these directories. When querying intraday data, or merging it into the historical database, all the intraday partitioning directories are combined into a single table. In most cases users are not directly aware of this partitioning, but configuring the tailer requires an in-depth understanding of this partitioning if it is to be used correctly.

Deephaven is highly optimized for appending data to tables. To accomplish this, it is required that only one data source causes data to be written to a given set of underlying intraday data files at any time. This is one of the primary reasons for intraday partitioning. Since most real-time data sources will use Deephaven loggers to generate data, and this data will be sent through the tailer, it is the tailer that determines the appropriate internal partition directories into which this intraday data will be appended.

There are two levels of partitioning that the tailer will determine when it sends data to the DIS. When the tailer is configured to set up both of these levels properly, it ensures the data is separated appropriately. The DIS will create distinct directories at each level of partitioning, and the lowest-level subdirectories will each contain a set of data files for the table.

  • Internal partitions are the first level of partitioning. Internal partitions usually correspond to a data source. The tailer configuration usually determines the internal partition value from a binary log file name, possibly applying information from its configuration to further distinguish it. Internal partitions are not determined by the schema. The internal partition is also called the destination partition in some Deephaven tools, such as importers.
  • Column partitions divide the data based on the partitioning column specified in the table's schema, a String column that frequently contains a date value. The tailer determines the column partition value based on a log file name. For date-based column partitioning, the value is usually in yyyy-MM-dd format (e.g., 2017-04-21).

Binary Log Filename Format

The tailer runs in a configuration where it automatically monitors binary log directories for new files, looking for new partitions and processing them appropriately. To assist in these file searches, binary log filenames must be in a standard format, with several parts separated by periods. Support is provided for filenames in other formats, but this requires additional configuration (explained later).

<Namespace>.<Table Name>.<System or User>.<Internal Partition>.<Column Partition>.bin.<Timestamp>

  • The first part should indicate the table's namespace.
  • The second part should indicate the table's name.
  • The third part should indicate the table's type ("System" or "User").
  • The fourth part should indicate the internal partition value.
  • The fifth part should indicate the column partition value.
  • Next comes the .bin. binary log file identifier.
  • Finally, the date and time the file was created to help with file sorting.

The following example from a Deephaven internal table illustrates a typical filename:

DbInternal.AuditEventLog.System.vmhost1.2018-07-23.bin.2018-07-23.135119.072-0600

  • DbInternal - the table's namespace.
  • AuditEventLog - the table's name.
  • System - the type of table, in this case belonging to a system namespace. User indicates a user namespace.
  • vmhost1 - the internal partition used to distinguish a data source, usually a host name.
  • 2017-07-23 - the column partition value. This is frequently a date in yyyy-MM-dd format (or another specified format that includes the same fields), but any partitioning method may be used as long as the filename can be used to distinguish the partition.
  • .bin. - the file identifier that distinguishes this as a Deephaven binary log file. The standard identifier is .bin..
  • 2018-07-23.135119.072-0600 - a timestamp so the files can be processed in order. The standard format used in this example is a timestamp down to the millisecond in yyyy-MM-dd.HHmmss.SSS format, followed by timezone offset. The timezone offset is important to correctly sort files during daylight-savings transitions.

Bandwidth Throttling

The tailer configuration provides the option to throttle the throughput of data sent to specific destinations, or the throughput sent for a specific log entry (per destination). This should only be used if network bandwidth is limited or issues have been encountered due to unexpected spikes in the amount of data being processed by the tailer. Because it restricts how quickly the tailer will process binary log files, use of this feature can cause Deephaven to fall behind in getting real-time data into the intraday database. Throttling is optional and will not be applied unless specified.

Two types of throttling are supported.

  • Destination-level throttling is applied to each destination (data import server), and is specified in the data routing configuration for a given data import server.
  • Log-entry-level throttling is applied to each destination for each log entry in the XML file, and is applied across all partitions for that log entry. It is specified with the maxKBps attribute in a log entry.

Throttles are always specified in kilobytes per second (KBps). Because the binary log files are sent directly to the data import servers without translation, the size of the binary log files can give an indication of the required bandwidth for a given table. For example, if a particular table's binary log files are sized at 1GB each hour, then an approximate bandwidth requirement could be calculated as follows:

  • Since throttles are specified in KBps, first translate GB to KB: 1GB = 1,024MB = 1,048,576 KB.
  • Next, divide the KB per hour by the number of seconds in an hour: 1,048,576 KB per hour / 3,600 seconds per hour = 291.3KB per second.
  • In this case, setting a 300KBps throttle would ensure that each hour's data could be logged, but latency will occur when data delivery to the log file spikes above the specified 300 KB/s throttle rate.

Throttling uses a token bucket algorithm to provide an upper bandwidth limit and smoothing of peaks over time. The token bucket algorithm requires, at a minimum, enough tokens for the largest messages Deephaven can send, and this is calculated by the number of seconds (see the log.tailer.bucketCapacitySeconds property) times the KBps for the throttle. At the current time, Deephaven sends maximum messages of 2,048KB (2MB), so the number of seconds in the bucket times the bandwidth per second must equal or exceed 2,048K. The software will detect if this is not true and throw an exception that prevents the tailer from starting.

The bucket capacity also impacts how much bandwidth usage can spike, as the entire bucket's capacity can be used at one time.

  • If log.tailer.bucketCapacitySeconds is 10, and a throttle allows 10KBps, then the maximum bucket capacity is 100KB, which is not sufficient for Deephaven; the bucket would never allow the maxiumum 2,048KB message through. Deephaven will detect this and fail.
  • If log.tailer.bucketCapacitySeconds is 60, and a throttle allows 1000KBps, then the bucket's capacity is 60,000K. If a large amount of data was quickly written to the binary log file, the tailer would immediately send the first 60,000K of data to the DIS. After that the usage would level out at 1,000KBps. The bucket would gradually refill to its maximum capacity once the data arrival rate dropped below 1,000KBps, until the next burst.

Binary Log File Managers

The tailer must be able to perform various operations related to binary logs.

  • It must derive namespace, table name, internal partition, column partition, and table type values from binary log filenames and the specified configuration log entry.
  • It must determine the full path and file prefix for each file pattern that matches what it is configured to search for.
  • It must sort binary log files to determine the order in which they will be sent.

These operations are handled through the use of Java classes called binary log file managers. Two binary log file managers are provided.

  • StandardBinaryLogFileManager - this provides functionality for binary log files in the standard format described above (<namespace>.<table name>.<table type>.<internal partition>.<column partition>.bin.<date-time>)
  • DateBinaryLogFileManager - this provides legacy functionality for binary log files using date-partitioning in various formats.

A default binary log file manager will be used for all log entries unless a specific log entry overrides it with the fileManager XML attribute. The log.tailer.defaultFileManager property specifies the default binary log file manager. The default value of this property specifies the DateBinaryLogFileManager.

For the tailer to function correctly, file names for each log must be sortable. Both binary log file manager types will take into account time zones for file sorting (including daylight savings time transition), assuming the filenames end in the timezone offset (+/-HHmm, for example "-0500" to represent Eastern Standard Time). The standard Deephaven Java logging infrastructure ensures correctly-named files with full timestamps and time zones. Note that time zones are not taken into account by the DateBinaryLogFileManager in determining the date as a column partition value, only in determining the order of generated files; the column partition's date value comes from the filename and is not adjusted.

Tailer Configuration

Deephaven property files and tailer-specific XML files are used to control the tailer's behavior.

  1. Property definitions - tailer behavior that is not specific to tables and logs is controlled by property definitions in standard Deephaven property files. This includes destination details and binary log file directories to be monitored.
  2. Table and log definitions - detailed definitions of the binary logs containing intraday data, and the tables to which those logs apply, is contained in XML configuration files. Each log entry in these XML files corresponds to a single binary log file manager instance.

Property File Configuration Parameters

This section details all the tailer-related properties that are read from the property file specified by the Configuration.rootFile Java parameter (usually specified in the host confguration with the CONFIGFILE environment variable, such as CONFIGFILE=iris-common.prop).

Properties may be specified in tailer property files, or passed in as Java parameters with the -D prefix (e.g., -DParameterName=ParameterValue).

Configuration File Specification

These parameters determine the processing of XML configuration files, which are read during tailer startup to build the list of logs that are handled by the tailer.

  • log.tailer.configs - a comma-delimited list of XML configuration files, which define the tables being sent. For example, the tailer that handles the internal binary logs has the following definition:
    • log.tailer.configs=tailerConfigDbInternal.xml
  • log.tailer.processes - a comma-delimited list of processes for which this tailer will be sending data. Each process should correspond to a "Process" tag in the XML configuration files. If the list is empty or not provided then the tailer will run but not handle any binary logs. If it includes an entry with a single asterisk (*), the XML entries for all processes are used. Otherwise the tailer only reads the entries from the XML configuration files for the processes listed therein (and any entries with a Process name of *).
    • log.tailer.processes=iris_controller,customer_logger_1

Tailer Parameters

These parameters specify the name and runtime attributes of the tailer.

  • intraday.tailerID - specifies the tailer ID (name) for this tailer. This is usually set by the startup scripts in the Java parameters rather than in the property file. For example, if the iris scripts are used, sudo /usr/illumon/latest/bin/iris start tailer 1 will cause intraday.tailerID to be set to 1 when the tailer is started; alternatively, the Java command line argument -Dintraday.tailerID=1 will have the same effect. The parameter can also be specified in the property file if desired.
    • intraday.tailerID=customer1
  • log.tailer.enabled.<tailer ID> - If this is true, then the tailer will start normally. If this is false, then the tailer will run without tailing any files.
    • log.tailer.enabled.customer1=true
  • log.tailer.bucketCapacitySeconds - the capacity in seconds for each bucket used to restrict the bandwidth to the Data Import Servers. This is applied independently to every throttle, both the destination-level throttles, and the log-level throttles specified in the XML. It must be large enough to handle the largest possible messages, as explained in the bandwidth throttling section. If the parameter is not provided, it defaults to 30 seconds.
    • log.tailer.bucketCapacitySeconds=120
  • log.tailer.retry.count - this optional parameter specifies how many times each destination thread will attempt to reconnect after a failure. The default is Integer.MAX_VALUE, effectively imposing no limit. Once this limit is reached for a given table/destination/internal partition/column partition set, the tailer will no longer attempt to reach the destination unless new files cause it to restart the connection. A value of 0 is valid and means that only one connection attempt will be made.
  • log.tailer.retry.pause - the pause, in milliseconds, between each reconnection attempt after a failure. The default is 1000 milliseconds (1 second).
  • log.tailer.poll.pause - the pause, in milliseconds, between each poll attempt to look for new data in existing log files or find new log files when no new data was found. A lower value will reduce tailer latency but increase processing overhead. The default is 100 milliseconds (.1 second).

File Watch Parameters

The tailer will watch for new files to send to its Binary Log File Manager instances for possible tailing, using underlying infrastructure. The logic used to watch for these files can be changed. This parameter should only be changed under advice from Deephaven Data Labs.

  • log.tailer.watchServiceType - a parameter that specifies the underlying watch service implementation. Options are:
    • JavaWatchService - an implementation that uses the built-in Java watch service. On most Linux implementations this is the most efficient implementation and can result in quicker file notifications, but it does not work correctly on shared file systems such as NFS when files are created remotely.
    • PollWatchService - a poll-based implementation that works on all known file systems but may be less efficient and consume more resources than the JavaWatchService option.
    • log.tailer.watchServiceType=JavaWatchService

Memory Parameters

The following parameters control the memory consumption of the tailer and the Data Import Server.

  • DataContent.producersUseDirectBuffers - Default is true. If true, the tailer will allocate direct memory for its data buffers. If you change this, you will need to adjust the tailer's JVM arguments in its hostconfig. The default settings are -Xmx256m -XX:MaxDirectMemorySize=1500m.
  • DataContent.consumersUseDirectBuffers - Default is false. If true, the Data Import Server will use direct memory for its data buffers.
  • BinaryStoreMaxEntrySize - Default is 1024 * 1024. Sets the maximum size in bytes for a single data row in a binary log file. This governs default buffer sizes below, and will affect appropriate values if the buffer sizes are changed.
  • DataContent.producerBufferSize - Default is 2 * BinaryStoreMaxEntrySize + 2 * Integer.BYTES. The size in bytes of buffers the tailer will allocate for each data stream. This is the size of the chunks the tailer will send to the DIS.
  • DataContent.consumerBufferSize - Default is 2 * producerBufferSize. The size in bytes of buffers the Data Import Server will allocate. This must be large enough for a producer buffer plus a full binary row.
  • DataContent.userPoolCapacity - Default is 128. The maximum number of user table locations that will be processed concurrently. If more locations are created at the same time, the processing will be serialized.
  • DataContent.producerBufferSize.user - Default is 256 * 1024. The size in bytes of buffers the tailer will allocate for each User table data stream.
  • DataContent.disableUserPool - Default is false. If true, user table locations are processed using the same resources as system tables, in which case user actions can consume unbounded tailer resources.

Miscellaneous Parameters

The following additional parameters control other aspects of the tailer's behavior.

  • log.tailer.defaultFileManager - specifies which binary log file manager is used for a log unless that log's XML configuration entry specifies otherwise. The default binary log file manager handles date-partitioned tables; if a different default binary log file manager is desired, then the parameter should be updated accordingly.
    • log.tailer.defaultFileManager=com.illumon.iris.logfilemanager.DateBinaryLogFileManager
  • log.tailer.defaultIdleTime - if the tailer sends no data for a namespace/table/internal partition/column partition combination for a certain amount of time (the "idle time"), it will terminate the associated threads and stop monitoring that file for changes (these threads will be restarted if new files appear). This parameter specifies the default idle time (in the format HH:mm or HH:mm:ss) after which this occurs. This is normally set for 01:10, which is a little more than the Deephaven logging infrastructure's binary log file rollover time.
  • log.tailer.defaultIdlePauseTime - when this optional property is set, if the tailer sends no data for a namespace/table/internal partition/column partition combination for a certain amount of time ("pause time", shorter than the "idle time"), it will close the connection to the DIS and release threads and related resources, but will continue to monitor the file for changes until the idle time is reached. If a change is detected, the connection is reestablished, and the data is sent. This property has the same format as log.tailer.defaultIdleTime.
  • log.tailer.defaultDirectories - an optional comma-delimited list of directories which specify where this tailer should look for log files. If not specified, the tailer will use the default value for this property from iris-defaults.prop: /var/log/deephaven/binlogs, /var/log/deephaven/binlogs/pel, and /var/log/deephaven/binlogs/perflogs. If iris-defaults.prop is not used or the property has been explicitly set to null, the tailer will monitor the tailer's process log directory; this defaults to <workspace>/../logs, but can be overridden by setting logDir in the JVM arguments (e.g., -DlogDir=/tmp/Tailer1) . The default location property is set in iris-defaults.prop as <logroot>/binlogs,<logroot>/binlogs/pel,<logroot>/binlogs/perflogs.
    • log.tailer.defaultDirectories=/Users/app/code/bin1/logs,/Users/app2/data/logs
  • log.tailer.logDetailedFileInfo - a boolean parameter that specifies whether the tailer logs details on every file every time it looks for data. The default value is false, which means the tailer only logs file details when it finds new files.
    • log.tailer.logDetailedFileInfo=false
  • log.tailer.additionalDirectories - an optional comma-delimited list of directories appended to the directories specified in log.tailer.defaultDirectories. This may be used if the default directory/directories are desired along with other non-default directories.
  • log.tailer.additionalDirectories=/Users/app/code/bin1/logs,/Users/app2/data/logs
  • log.tailer.logBytesSent - a boolean parameter that specifies whether the tailer logs information every time it sends data to a destination. The default value is true, which means the tailer logs details every time it sends data to a DIS.
    • log.tailer.logBytesSent=true

Date Binary Log File Manager Parameters

  • log.tailer.timezone - this optional parameter is used to specify the time zone used for log processing. If it's not specified, the timezone will be the time zone on which the server runs. It can be used in the calculations of the path patterns for date-partitioned logs and tables, but is not used in other aspects of the tailer's operation. These time zone names come from the DBTimeZone class.
    • log.tailer.timezone=TZ_NY
  • log.tailer.internalPartition.prefix - the prefix for the tailer's internal partition, which will be used by any DateBinaryLogFileManager instances to determine the start of the internal partition name, if the name is not being determined from the logs' filenames (the partition suffix from each log's configuration entry will be appended to this prefix to generate the complete internal partition name). If this is not provided, the tailer will attempt to determine the name of the server on which it is running, and use this as a prefix instead. If the tailer cannot determine a name, localhost is used.
    • log.tailer.internalPartition.prefix="source1"

Tailer XML Configuration Parameters

Details of the logs to be tailed are supplied in XML configuration files. The list of XML files is specified by the log.tailer.configs parameter.

Configuration File Structure

Each tailer XML configuration file must start with the <Processes> root element. Under this element, one or more <Process> elements should be defined.

<Process> elements define processes that run on a server, each with a list of Log entries which define the binary log file manager details for the tailer. Each <Process> element must have a name that specifies the application process that generates the specified logs; the log tailer uses the log.tailer.processes property value to determine which <Process> entries to use from its XML configuration files. For example the following entry specifies two binary log file managers for the db_internal process; a tailer that includes db_internal in its log.tailer.processes parameter would use these binary log file managers. Each parameter is explained below.

<Processes>
    <Process name="db_internal">
        <Log filePatternRegex="^DbInternal\.(?:[^\.]+\.){2}bin\..*"
             tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"
             namespace="DbInternal"
             internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"
             path=".bin.$(yyyy-MM-dd)" />
        <Log fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager" />
    </Process>
</Processes>

If the <Process> name is a single asterisk (*), every tailer will use that Process element's Log entries regardless of its log.tailer.processes value.

Each Log element generates one binary log file manager instance to handle the processing of the associated binary log files. Binary log files are presented to each binary log file manager in the order they appear in the XML configuration files. Once a file manager claims the file, it is not presented to any other file managers.

To see an example of how to set up a customer tailer, see the Tailer Quick Start Guide.

Binary Log File Manager Types

Each Log element specifies information about the binary log files which a tailer processes. The attributes within each Log entry vary depending on the type of binary log file manager chosen. Full details on each attribute are provided in the Log Element Attributes section.

Both binary log file managers allow the following optional attributes to override default tailer parameters; each is explained in detail later.

  • idleTime changes the default idle time for when threads are terminated when no data is sent.
  • fileSeparator changes the default file separator from .bin..
  • logDirectory or logDirectories specifies directories in which the binary log file manager will look for files.
  • excludeFromTailerIDs restricts a Log entry from tailers running with the given IDs.
  • maxKBps specifies a Log-level throttle.

Standard Binary Log File Manager

The StandardBinaryLogFileManager looks for files named with the default <namespace>.<table name>.<table type>.<internal partition>.<column partition>.bin.<date-time> format. It will be the most commonly used file manager, as it automatically handles files generated by the common Deephaven logging infrastructure.

  • The namespace, table name, table type, internal partition, and column partition values are all determined by parsing the filename.
  • The optional namespace or namespaceRegex attributes can restrict the namespaces for this binary log file manager.
  • The optional tableName or tableNameRegex attributes can restrict the table names for this binary log file manager.
  • The optional tableType attribute restricts the table type for this binary log file manager.

Date Binary Log File Manager

The DateBinaryLogFileManager requires further attributes to find and process files. It provides many options to help parse filenames, providing compatibility for many different environments.

  • A path attribute is required, and must include a date-format conversion. It may also restrict the start of the filename.
  • If path does not restrict the filename, a filePatternRegex attribute should be used to restrict the files found by the binary log file manager.
  • A namespace or namespaceRegex attribute is required, to determine the namespace for the binary log files.
  • A tableName or tableNameRegex attribute is required, to determine the table name for the binary log files.
  • The optional internalPartitionRegex attribute can be specified to indicate how to determine the internal partition value from the filename. If the internalPartitionRegex attribute is not specified, the internal partition will be built from the log.tailer.internalPartition.prefix property or server name, and the internalPartitionSuffix attribute.
  • The column partition value is always determined from the filename, immediately following the file separator.
  • The table type is taken from the optional tableType attribute. If this is not specified, then the table type is System.

Custom Binary Log File Managers

The existing binary log file managers will cover most use cases; in particular, the StandardBinaryLogFileManager will automatically find and tail files which use the standard Deephaven filename conventions. If there is a need for additional functionality it can be added by creating a custom BinaryLogFileManager and specifying it in the tailer's XML configuration file. For example, if there's a need to find files from before today during tailer startup, this would require a custom BinaryLogFileManager.

See the javadoc for com.illumon.iris.logfilemanager.BinaryLogFileManager for details on how to do this. It may be easier to extend the StandardBinaryLogFileManager class as a starting point.

Log Element Attributes

Unless otherwise specified, all parameters apply to both regular-expression entries and fully-qualified entries. These are all attributes of a <Log> element.

  • excludeFromTailerIDs - an optional comma-delimited list, to cause this entry to be ignored for a tailer if its tailer ID matches one of the specified tailer IDs from this parameter. It may be used to force a tailer to ignore specific Log elements from a process that it would normally tail.
    • excludeFromTailerIDs="CustomerTailer1"
  • fileManager - a named binary log file manager for this Log. If defined, it can refer to either a named file manager from the tailer's XML configuration files, or a fully-qualified Java class.
    • fileManager="BasicFiles"
    • fileManager="com.customer.iris.CustomLogFileManager"
  • filePatternRegex - a regular expression used to restrict file matches for a date binary log file manager. If the regular expression finds a match, then the file will be handled by the binary log file manager. The match is done through a group matcher, looking for the first matching expression.
    • filePatternRegex="^[A-Za-z0-9_]*\.stats\.bin\.*"
  • fileSeparator - if the files to be handled by this binary log file manager do not include the standard .bin. separator, this specifies the separator to be used.
    • fileSeparator="."
  • idleTime - the time (in the format HH:mm or HH:mm:ss) after which threads created by this binary log file manager will be terminated if no data has been sent.
    • idleTime="02:00"
  • internalPartitionRegex - for date binary log file managers, a regular expression that will be applied to the filename to determine the internal partition. The match is done with a capture group, looking for the first matching expression. If this property is specified, log.tailer.internalPartition.prefix and internalPartitionSuffix are ignored.
    • internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"
  • internalPartitionSuffix - for date binary log file managers, the suffix to be added to the internal partition prefix from the log.tailer.internalPartition.prefix parameter to determine the intraday partition for this log.
    • internalPartitionSuffix="DataSource1"
  • logDirectory - an optional parameter that specifies a single directory in which the log files will be found. If neither logDirectory nor logDirectories are supplied, the tailer's default directories will be used. Both logDirectory and logDirectories cannot be supplied in the same Log element.
    • logDirectory="/usr/app/defaultLogDir"
  • logDirectories - an optional comma-delimited list of directories that will be searched for files. If neither logDirectory nor logDirectories are supplied, the tailer's default directories will be used. Both logDirectory and logDirectories cannot be supplied in the same Log element.
    • logDirectories="/usr/app1/bin,/usr/app2/bin"
  • maxKBps - the maximum rate, in kilobytes per second, at which data can be sent to each Data Import Server for this Log element. This is applied per destination, but shared across all logs being tailed for the Log entry. It is applied independently of any destination-specific entries from the property files. See the bandwidth throttling section for further details.
    • maxKBps="1000"
  • namespace - this parameter's meaning depends on the binary log file manager type. For a standard binary log file manager, it restricts the namespace that this file manager processes. If a file arrives for a namespace other than the defined value, it is not handled by this binary log file manager. For a date binary log file manager, it defines the namespace for this log entry. If it is not provided, then namespaceRegex must be supplied to determine the table name from the filename.
    • namespace="CustomerNamespace"
  • namespaceRegex - this parameter's meaning depends on the binary log file manager type. For a standard binary log file manager, it restricts the namespaces that this file manager processes. If the namespace doesn't match the regular expression, then files for that namespace are not handled by this file manager. For a date binary log file manager, it defines a regular expression that will be applied to the filename to determine the namespace. The match is done with a capture group, looking for the first matching expression.
    • namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*"
  • path - for date binary log file managers, the filename for the specified log, not including directory. This may define the filename prefix, not including directory, and must include a date specification in the format $(<date formatting specification>). A typical entry will be in one of the following formats:
    • <namespace>.<tablename>.bin.$(yyyy-MM-dd) - this restricts the filename to start with a given namespace and table name and specifies that the column partition value part of the filename is in the yyyy-MM-dd format. This will usually be used in a Log element for a specific namespace and table. For example, path="CustomerNamespace.Table1.bin.$(yyyy-MM-dd)".
    • path=".bin.$(yyyy-MM-dd)" - this indicates that there are no restrictions on the filename, and that the column partition value part of the filename is in yyyy-MM-dd format. This will usually be used in combination with regular-expression attributes to determine namespaces and table names, and will usually additionally include a filePatternRegex attribute.
    • For date-partitioned logs, a date format sequence in the format $(<date formatting expression>) is required as it is used to translate between file names and date partition values in the standard Deephaven yyyy-MM-dd date-partitioning format. For example, if the filenames use the format yyyyMMdd, then the format sequence should be $(yyyyMMdd) and the tailer will automatically convert between this format and the Deephaven date partitioning standard. The date format sequence must immediately follow the file separator.
  • tableName - this parameter's meaning depends on the binary log file manager type. For a standard binary log file manager, it restricts the table name that this file manager processes. If a file arrives for a table name other than the defined value, it is not handled by this binary log file manager. For a date binary log file manager, it defines the name of the table for this log entry. If it is not provided, then tableNameRegex must be supplied to determine the table name from the filename.
    • tableName="Table1"
  • tableNameRegex - this parameter's meaning depends on the binary log file manager type. For a standard binary log file manager, it restricts the table names that this file manager processes. If the table name doesn't match the regular expression, then files for that table name are not handled by this file manager. For a date binary log file manager, it defines a regular expression that will be applied to the filename to determine the table name. The match is done with a capture group, looking for the first matching expression.
    • tableNameRegex="^(?:[^\.]+\.){0}([^\.]+).*"
  • tableType - this parameter's meaning depends on the binary log file manager type. For a standard binary log file manager, it restricts the table type that this file manager processes. If a file arrives for a table type other than the defined value, it is not handled by this binary log file manager. For a date binary log file manager, it defines the table type for this log entry. If it is not provided, then a table type of System is used.
    • tableType="User"

Regular Expression Example

It may be helpful to examine a standard set of regular expressions that may be used in common tailer configurations. A common filename pattern will be <namespace>.<tablename>.<host>.bin.<date/time>; the host will be used as the internal partition value. The following example filename provides an example of this file format:

EventNamespace.RandomEventLog.examplehost.bin.2017-04-01.162641.013-0600

In this example, the table name can be extracted from the filename with the following regular expression:

tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

This regular expression works as follows:

  • ^ ensures the regex starts at the beginning of the filename.
  • (?: starts a non-capturing group, which is used to skip past the various parts of the filename separated by periods.
  • [^\.]+ as part of the non-capturing group; matches any character that is not a period, requiring one or more matching characters
  • \. as part of the non-capturing group; matches a period.
  • ){1} ends the non-capturing group and causes it to be applied 1 time. In matches for other parts of the filename, the 1 will be replaced with how many periods should be skipped; this causes the regex to skip the part of the filename up to and including the first period.
  • ([^\.]+) defines the capturing group, capturing everything up to the next period, which in this case captures the table name.
  • .* matches the rest of the filename, but doesn't capture it.

For this example, the three attributes that would capture the namespace, table name, and internal partition are very similar, only differing by the number that tells the regex how many periods to skip:

  • namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*"

  • tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

  • internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*"

The above example could result in the following regular expression Log entry, which would search for all files that match the <namespace>.<tablename>.<host>.bin.<date/time> pattern in the default log directories:

<Log filePatternRegex="^[A-Za-z0-9_]+\.[A-Za-z0-9_]+\.[A-Za-z0-9_]+\.bin\..*" namespaceRegex="^(?:[^\.]+\.){0}([^\.]+).*" tableNameRegex="^(?:[^\.]+\.){1}([^\.]+).*"

internalPartitionRegex="^(?:[^\.]+\.){2}([^\.]+).*" runTime="25:00:00" path=".bin.$(yyyy-MM-dd)" />

Direct-to-DIS CSV Tailing

Deephaven supports tailing CSV files directly to the Data Import Server. To configure CSV tailing:

  1. Create and properly configure a schema with an ImportSource element. This schema may be handwritten or generated via Schema Discovery.
  2. Set up appropriate Process and Log elements (see Tailer XML Configuration Parameters).
  3. Mark the process as CSV format by adding an attribute fileFormat="CSV".
  4. Add a new tag <Metadata> to provide the DIS with important information it needs to properly process the CSV files.

The <Metadata> tag should contain a collection of <Item name="name" value="value"/> entries. The valid Items are:

  • importSourceName - (Required) The name of a valid ImportSource from the table’s schema. The DIS will use this description to parse the CSV.
  • charset - (Optional) The charset encoding of the files. If omitted, this defaults to UTF-8.
  • format - (Optional) The format of the CSV file. This may be any one of the supported CSV formats except BPIPE (see Importing CSV files).
  • hasHeader - (Optional) Indicate if the file has a header line or not. This defaults to true.
  • delimiter - (Optional) The delimiter between CSV values. If omitted, this uses the delimiter of the specified format.
  • trim - (Optional) Specify that CSV fields should be trimmed before they are parsed.
  • rowsToSkip - (Optional) The number of rows (excluding the header) to skip from each file tailed.
  • constantValue - (Optional) - The value to use for constant fields.

Once these have been configured and the tailer started, any CSV files produced in the specified paths will be tailed directly to the DIS.

An example follows:

<Processes>
    <Process name="csv_data">
        <Log name="MyTableName"
            namespace="MyNamespace"

fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager"
            logDirectory="/path/to/your/data"
            idleTime="04:00"
            fileSeparator=".csv."
            fileFormat="CSV">
           <Metadata>
             <Item name="importSourceName" value="IrisCSV"/>
             <Item name="trim" value="false"/>
             <Item name="hasHeader" value="true"/>
             <Item name="constantValue" value="SomeValue"/>
           </Metadata>
        </Log>
    </Process>
</Processes>

Transactional Processing

In many cases it is desirable to process sets of rows as a transaction. There are two different methods for supporting this paradigm.

Transaction column

The first, and best, method is to designate a specific column in the CSV as a “Transaction” column. This gives the source that is writing the CSV the ability to control when transactions are started and completed. To enable this mode of processing you must add the rowFlagsColumn metadata item to the tailer configuration file.

<Item name=”rowFlagsColumn” value=”rowFlags”/>

This will designate a column “rowFlags” as a special column. Transactions are started by writing a row with StartTransaction in the “rowFlags” column. Subsequent rows that are part of the transaction should have “None” written to the “rowFlags” column, and to end the transaction write a row with “EndTransaction” in the “rowFlags” column.

You may write “SingleRow” for rows that should be logged on their own, not part of a transaction.

Caution

Note that, while you are in a transaction writing “SingleRow” or “StartTransaction” will abort the transaction in progress!

Automatic Transactions

There may be cases where the format of the CSV may not be changeable. In these cases you may use automatic transactions to accomplish a similar goal.

Caution

Note that automatic transactions are not recommended unless they are absolutely necessary.

Automatic transactions ensure that every row that is logged is part of a transaction. When a new file is encountered or no rows have been received within the timeout window the current transaction is completed and subsequent rows will begin a new transaction.

To enable automatic transactions, add the following setting to the tailer configuration metadata:

<Item name="transactionMode" value="NewFileDeadline"/> <Item name="transactionDeadlineMs" value="30000"/>

The first setting enables automatic transactions, while the second one specifies the inactivity timeout in milliseconds for which the system will automatically complete a transaction.

Running a remote tailer on Linux

The Deephaven log tailer (LogTailerMain class) should ideally run on the same system where log files are generated. It is strongly recommended that the storage used by the logger and tailer processes is fast, local storage, to avoid any latency or concurrency issues that might be introduced by network filesystems.

Setup process

To set up an installation under which to run the LogTailerMain class:

  1. Install Java SE runtime components, including the Java JDK (devel package). The tailer can sometimes be run with a JRE, but if advanced non-date partition features or complex data routing filtering are used, a JDK will be required. This should match the version of Java installed on the Deephaven cluster servers. Presence and version of a JDK can be verified with javac -version.
  2. Create a directory for a tailer launch command, configuration files, and workspace and logging directories. This directory must be accessible for read and write access to the account under which the tailer will run.
  3. Obtain the JAR and configuration files from a Deephaven installation (either copied from a server, or installed and/or copied from a client installation using the Deephaven Launcher), or use the Deephaven Updater to synchronize these from the Client Update Service. The Deephaven Updater is the recommended approach, as it allows automation of runtime updates for headless systems.
cd /home/mytaileraccount
wget "https://mydhserver.mydomain.com:8443/DeephavenLauncher_9.02.tar"
tar xf DeephavenLauncher_9.02.tar
/home/mytaileraccount/DeephavenLauncher/DeephavenUpdater.sh Deephaven https://mydhserver.mydomain.com:8443/iris/

Note

The current version of the DeephavenLauncher tar will vary from release to release, and the actual URL will vary with your installation. See Find the Client Update Service Url to determine the correct values.

Deephaven in the above example is the friendly name for this instance. Any name can be used, but a name without spaces or special characters will be easier to use in scripts.

The Deephaven Updater should be re-run periodically to ensure the tailer has up-to-date binaries and configuration from the Deephaven server. We recommend running it before each tailer startup as part of the tailer start script.

  1. Create the following directories, and ensure the account under which the tailer will run has access to them:
  • /var/log/deephaven/tailer - the process logging path. It is sufficient, and recommended, to create /var/log/deephaven and grant the tailer account read/write access to this path. An alternate logging path can be configured with a custom value passed to the -DlogDir JVM argument when starting the tailer.
  • /var/log/deephaven/binlogs - the root path for binary log files, and the recommended location for local loggers to write their log files.
  • /var/deephaven/run - the account will need read/write access here to create and delete PID files (can be overridden with a custom JVM argument -DpidFileDirectory).
  • /var/deephaven/binlogs/pel and /var/deephaven/binlogs/perflogs - these directories are default monitored paths by all tailers; other Deephaven processes may use them on the remote tailer host.
  1. Create a launch command to provide the correct classpath, properties file, and other installation-specific values needed to run the tailer process.
#!/bin/bash
java -cp "/home/mytaileraccount/iris/.programfiles/Deephaven/resources":"/home/mytaileraccount/iris/.programfiles/Deephaven/java_lib/*":"/home/mytaileraccount/iris/.programfiles/Deephaven":"/home/mytaileraccount/iris" \
-server -Xmx4096m -DConfiguration.rootFile=iris-common.prop \
-Dworkspace="/home/mytaileraccount/Tailer/workspaces" \
-Dlog.tailer.configs=tailer.xml \
-Ddevroot="/home/mytaileraccount/iris/.programfiles/Deephaven/java_lib/" \
-Ddh.config.client.bootstrap=/home/mytaileraccount/iris/.programfiles/Deephaven/dh-config/clients \
-Dprocess.name=tailer \
-Dservice.name=iris_console \
-Dlog.tailer.processes=monitoring \
-Dintraday.tailerID=1 com.illumon.iris.logtailer.LogtailerMain

Note

The classpath paths include Deephaven configuration files (resources), jars (java_lib/*), a root path for other configuration (Deephaven), and, in this example, the directory that contains the tailer config XML file (iris).

  1. Configure startup for the tailer process.
[Unit]
Description=Deephaven Tailer 1
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service
StartLimitIntervalSec=100
StartLimitBurst=5

[Service]
User=mytaileraccount
Group=mytailergroup
ExecStart=/home/mytaileraccount/iris/tailer-start.sh
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target
sudo systemctl enable tailer1.service
sudo systemctl start tailer1.service

Management and logging

The remote tailer can be managed using systemd commands:

sudo systemctl status tailer1.server
sudo systemctl stop tailer1.server

Logging for the tailer process will be under /var/log/deephaven/tailer, unless an alternate path is passed to the -DlogDir JVM argument.

Running a remote tailer on Windows

The Deephaven log tailer (LogTailerMain class) should ideally be run on the same system where log files are being generated. Since this class has no dependency on *nix filesystem or scripting functionality, it can be run under Java as a Windows process or service when a Windows system is generating binary log data.

Requirements

To set up an installation under which to run the LogTailerMain class:

  1. Install Java SE runtime components or use the Illumon Launcher installation that includes the Java JDK. The tailer can sometimes be run with a JRE, but if advanced non-date partition features or complex data routing filtering are used, a JDK will be required.
  2. Create a directory for a tailer launch command, configuration files, and workspace and logging directories. This directory must be accessible for read and write access to the account under which the tailer will run.
  3. Create the following directories, and ensure the account under which the tailer will run has access to them:
  • \var\log\deephaven\tailer - the process logging path. It is sufficient, and recommended, to create \var\log\deephaven and grant the tailer account read\write access to this path. An alternate logging path can be configured with a custom value passed to the -DlogDir JVM argument when starting the tailer.
  • \var\log\deephaven\binlogs - the root path for binary log files, and the recommended location for local loggers to write their log files.
  • \var\deephaven\run - the account will need read\write access here to create and delete PID files (can be overridden with a custom JVM argument -DpidFileDirectory).
  • \var\deephaven\binlogs\pel and \var\deephaven\binlogs\perflogs - these directories are default monitored paths by all tailers; other Deephaven processes may use them on the remote tailer host.
  1. Obtain the JAR and configuration files from a Deephaven installation (either copied from a server, or installed and/or copied from a client installation using the Deephaven Launcher), or use the Deephaven Updater to synchronize these from the Client Update Service. We recommend using the Deephaven Updater as it can automate runtime updates for headless systems.

  2. Create a launch command to provide the correct classpath, properties file, and other installation-specific values needed to run the tailer process.

  3. Configure startup for the tailer process.

Note

Since Windows does not support inotify, it will be necessary to configure tailers on Windows to use the PollWatchService.

Example Directory

img

This directory was created under Program Files, where administrative rights will be needed by the account under which the tailer will run. In this example, the tailer is being run as a service under LocalSystem, and the Deephaven resources and jars were manually synchronized to this path. Specific accounts with more limited rights than LocalSystem would be valid as well. The key requirements are that the account must be able to write to the logs and workspace directories, must be able to read from all files to be tailed and all configuration and JAR files, and must have sufficient network access to contact the DIS.

Note that this directory contains a tailerConfig XML file, which specifies the tailer's configuration (i.e., which binary log files it will tail) and a property file, which specifies other tailer configuration options, including where the tailer will send the data. For full details on tailer configuration see Tailer Configuration. This directory also contains a logs and a workspace directory, which can be created empty, and a java_lib folder that was copied, with its JAR files, from the directory structure set up by the Deephaven Launcher client installer.

As new releases of Deephaven are applied to an environment, the JAR files in java_lib must be updated so that any new tailer or tailer support functionality is also available to Windows-based tailer processes. The Deephaven Updater makes it simple to automate this update.

We recommend running the Updater before each tailer startup as part of the tailer start script.

Launch Command

The launch command in this example is a Windows cmd file. It contains (effectively) one line (^ is the line continuation character for Windows .bat and .cmd files):

java -cp^
 "[instance_root]\[instance_name]\resources";"[instance_root]\[instance_name]\java_lib\*";"[location_of_tailerConfig.xml_file]"^
 -server -Xmx4096m^
 -DConfiguration.rootFile=iris-common.prop^
 -Dworkspace="[instance_root]\[instance_name]\workspaces"^
 -Ddevroot="[instance_root]\[instance_name]\java_lib"^
 -Dprocess.name=tailer^
 -Dservice.name=iris_console^
 -Dlog.tailer.processes=monitoring^
 -Ddh.config.client.bootstrap="[instance_root]\[instance_name]\dh-config\clients"^
 -Dintraday.tailerID=1 com.illumon.iris.logtailer.LogtailerMain

The components of this command follow:

  • java - Must be in the path and invoke Java matching the version used by the Deephaven server installation.
  • -cp "[instance_root]\[instance_name]\resources";"[instance_root]\[instance_name]\java_lib\*";"[location_of_tailerConfig.xml_file]" - The class path to be used by the process.
    • The first part is the path to configuration files used for the Deephaven environment, such as properties files and the data routing YAML file.
    • The second part is the path to the JAR files from the Deephaven server or client install.
    • The third part points to the directory that contains the configuration XML file that the process will need.
  • -server - Run Java for a server process.
  • -Xmx1024m - How much memory to give the process. This example is 1GB, but smaller installations (tailing less files concurrently) will not need this much memory. 1GB will be sufficient for most environments but if a lot of logs are tailed or data throughput is fairly high this may need to be increased.
  • -DConfiguration.rootFile=iris-common.prop - The initial properties file to read. This file can include other files, which must also be in the class path.
    • iris-common.prop is the default unified configuration file for Deephaven processes.
    • It is also possible to specify or override many tailer settings by including other -D entries instead of writing them into the properties file.
  • -Dworkspace="[instance_root]\[instance_name]\workspaces" - The working directory for the process.
  • -Ddevroot="[instance_root]\[instance_name]\java_lib" - Path to the Deephaven binaries (JARs). Same as the first half of the class path in this example.
  • -Dprocess.name=tailer - The process name for this process.
  • -Dservice.name=iris_console - Use stanzas from properties files suitable for a remote client process.
  • -Dlog.tailer.processes=[comma-separated list] - Process names to be found in the tailer config XML files.
  • -Ddh.config.client.bootstrap="[instance_root]\[instance_name]\dh-config\clients" - specifies where to find configuration server host and port details.
  • -Dintraday.tailerID=1 - ID for the tailer process that must match tailer instance specific properties from the properties file.
  • com.illumon.iris.logtailer.LogtailerMain - The class to run.

Other properties that could be included, but are often set in the properties file instead:

  • -Dlog.tailer.enabled.1=true
  • -Dlog.tailer.configs=tailerConfig.xml
  • -DpidFileDirectory=c:/temp/run/illumon

Also, if logging to the console is desired (e.g., when first configuring the process, or troubleshooting), -DLoggerFactory.teeOutput=true will enable log messages to be teed to both the process log file and the console.

Warning

This should not be set to true in a production environment. It will severely impact memory usage and performance.

If the pidFileDirectory is not overridden, the process will expect to find C:\var\run\illumon in which to write a pid file while it is running.

The process will also expect C:\var\log\deephaven\binlogs\pel and C:\var\log\deephaven\binlogs\perflogs to exist at startup. These directories should be created as part of the tailer setup process.

Automating Execution of the Tailer

There are several options to automate execution of the LogTailerMain process. Two possibilities are the Windows Task Scheduler or adding the process as a Windows Service.

The Task Scheduler is a Windows feature, and can be used to automatically start, stop, and restart tasks. Previous versions of the tailer process needed to be restarted each day, but current versions can be left running continuously.

An easy way to configure the launch command to run as a Windows Service is to use NSSM (The Non-Sucking Service Manager). This free tool initially uses a command line to create a new service:

nssm install "service_name"

The tool will then launch a UI to allow the user to browse to the file to be executed as a service (e.g., runTailer.cmd, in this example), and to specify other options like a description for the service, startup type, and the account under which the service should run. Once completed, this will allow the tailer to be run and managed like any other Windows service.

img

Tailer Quick Start Guide

The following is a quick start guide to setting up a simple new tailer called Customer1 that runs on a Deephaven server, and looks for binary logs for one table in the standard locations.

  1. Edit the host configuration file, and add an entry for the new tailer. Your system may have multiple host configuration files, so edit the customer-editable file that won't be overwritten during the next upgrade:

sudo vi /etc/sysconfig/illumon.confs/illumon.iris.hostconfig

The new entry should be in the existing tailer) block.

  • If another tailer has not already been added, this will be empty. Add the case statement and the rest of the parameters.
  • If another tailer has already been added, the case statement will already exist. Just add the block for the new tailer.

These values will override any values in the tailer) block in the Deephaven-maintained file for this Customer1 tailer. If this is the first extra tailer, the block will typically look like the following. In this example, the RUN_AS and WORKSPACE match the values in the Deephaven-maintained file, but are shown for clarity.

    tailer)
        case "$proc_suffix" in
            Customer1)
                PROCESSES="customer_app"
                EXTRA_ARGS="-j -Dservice.name=tailerCustomer1"
                RUN_AS=irisadmin
                WORKSPACE=/db/TempFiles/$RUN_AS/$proc$proc_suffix
                ;;
        esac
        ;;
  1. Add a monit entry for the new tailer.

sudo -u irisadmin vi /etc/sysconfig/illumon.d/monit/tailerCustomer1.conf

The new file should contain the following:

check process tailerCustomer1 with pidfile /etc/deephaven/run/tailerCustomer1.pid
    start program = "/usr/illumon/latest/bin/iris start tailer Customer1"
    stop program  = "/usr/illumon/latest/bin/iris stop tailer Customer1"
  1. Create a new stanza within the iris-environment.prop file for the new tailer. This uses a stanza with the service.name specified in the host configuration file.
  • Specify the XML configuration file it will use.

  • Specify that it is enabled.

Assuming you're running on a typical Deephaven server, you'll need to export the property file from etcd before editing, and import it after editing.

[service.name=tailerCustomer1] {
    log.tailer.enabled.tailerCustomer1=true
    log.tailer.configs=tailerConfigCustomer1.xml
}
  1. Create the tailer's XML file to specify the binary log(s) to tail. This filename must match the name specified in the properties file, and be in a location on the Java classpath, typically /etc/sysconfig/illumon.d/resources.

sudo -u irisadmin vi /etc/sysconfig/illumon.d/resources/tailerConfigCustomer1.xml

A simple XML file could have the following details. Adjust the values of the tableName and namespace attributes so that they contain the appropriate values.

<Processes>

  <Process name="customer_app">

    <Log fileManager="com.illumon.iris.logfilemanager.StandardBinaryLogFileManager"
      tableName="CustomerTable"
      namespace="CustomerNamespace"/>

  </Process>

</Processes>

Warning

Unless the iris-defaults.prop value for log.tailer.defaultDirectories has been overridden, each tailer will monitor /var/log/deephaven/binlogs, /var/log/deephaven/binlogs/pel, and /var/log/deephaven/binlogs/perflogs. This is in addition to any directories added with logDirectory in the tailer's config xml, and any values set in the log.tailer.additionalDirectories property.

  1. If needed, update the system's routing YAML to specify where the table should be sent. See Data routing service ia YAML for further details.

  2. Ensure that monit reloads its configuration to see the new tailer.

sudo -u irisadmin monit reload

  1. If the new tailer hasn't started with the monit reload, start it now.

sudo -u irisadmin monit start tailerCustomer1

The tailer should start and create its own logs, indicating that it's finding and tailing the appropriate binary log files. The logs will go in the standard tailer log location, typically /var/log/deephaven/tailer.