CSV Schema Inference

CSV schema inference in Deephaven streamlines the process of importing tabular data by automatically generating table schemas from your CSV files. This automation saves time and reduces errors, especially when working with new, evolving, or unfamiliar datasets.

The tool scans a sample of your data to infer column names, data types, and other schema details. Column names are typically derived from headers, but options exist for headerless files. For non-String columns, all values are checked to ensure the inferred type is valid for every entry. If a value doesn't fit the current type (for example, a float in an integer column), the type is promoted; if a String is encountered, the column is marked as a String and further type checking stops. If a column is empty for all rows, it is marked as a String and a warning is logged—review these cases before adding the schema to Deephaven. Date/time columns are handled specially: the tool attempts to match date/time strings to known formats, but if multiple formats are present, the column is marked as a String. Using a representative sample of your data helps ensure accurate inference and smoother onboarding.

When to use CSV schema inference

Ingesting evolving or unfamiliar CSV data sources.
Automating schema creation for new tables.
Working with CSV files where manual schema definition is impractical.

Input requirements

CSV file should have a header row with column names (or specify options for headerless files).
Supported delimiters: comma (default), tab, semicolon, etc.

Example of a valid CSV file:

name,age,score
Alice,30,85.5
Bob,25,90.0

Ensure consistent number of columns in each row.
File should be UTF-8 encoded.

Example

This example creates CSVExampleNamespace.CSVExampleTableName.schema in the /tmp directory:

iris_exec csv_schema_creator -- --namespace CSVExampleNamespace --tableName CSVExampleTableName --sourceFile /data/sample.csv --schemaPath /tmp

--namespace: The namespace for the new schema.
--tableName: The name of the table to create.
--sourceFile: Path to your CSV file.
--schemaPath: Output directory for the schema file.

Command Reference

iris_exec csv_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the CSV schema creator:

Argument	Description
`-ns` or `--namespace <namespace>`	(Required) The namespace to use for the new schema.
`-tn` or `--tableName <name>`	(Required) The table name to use for the new schema.
`-sp` or `--schemaPath <path>`	An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
`-sf` or `--sourceFile <file name or file path and name>`	The name of the CSV file to read (required). This file must have a header row with column names.
`-fd` or `--delimiter <delimiter character>`	Field delimiter (optional). Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, fileFormat is ignored. This must be a single character.
`-ff` or `--fileFormat <format name>`	(Optional) The Apache Commons CSV parser is used to parse the file itself. Five common formats are supported: `DEFAULT` – default format if none is specified; comma-separated field values, newline row terminators, double-quotes around field values that contain embedded commas, newline characters, or double-quotes. `TRIM` - Similar to `DEFAULT`, but will trim all whitespace around values. `EXCEL` – Microsoft Excel CSV format. `MYSQL` – MySQL CSV format. `RFC4180` – IETF RFC 4180 MIME text/csv format. `TDF` – Tab-delimited format.
`-pc` or `--partitionColumn`	Optional name for the partitioning column if schema is being generated. If not provided, the importer will default to "Date" for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
`-gc` or `--groupingColumn`	Optional column name that should be marked as `columnType="Grouping"` in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
`-spc` or `--sourcePartitionColumn`	Optional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column "source_date", specify "source_date" with this option (this is the column name in the data source, not the Deephaven column name).
`-sn` or `--sourceName <name for ImportSource block>`	Optional name to use for the generated `ImportSource` block in the schema. If not provided, the default of "IrisCSV" will be used.
`-sl` or `--skipHeaderLines <integer value>`	Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
`-fl` or `--setSkipFooterLines <integer value>`	Optional number of footer lines to skip from the end of the file.
`-lp` or `--logProgress`	If present, additional informational logging will be provided with progress updates during the parsing process.
`-bf` or `--bestFit`	If present, the class will attempt to use the smallest numeric types to fit the data in the CSV. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will move to int and eventually long. The default behavior, without `-bf`, is to use long for all integer columns and double for all floating point columns.
`-tr` or `--trim`	Similar to the `TRIM` file format, but adds leading/trailing whitespace trimming to any format. So, for a comma-delimited file with extra whitespace, `-ff TRIM` would be sufficient, but for a file using something other than a comma as its delimiter, the `-tr` option would be used in addition to `-ff` or `-fd`.
`-om` or `--outputMode`	Either `SAFE` (default) or `REPLACE`. When `SAFE`, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to `REPLACE`, a pre-existing file will be overwritten.

Troubleshooting

Malformed CSV: Ensure your file is well-formed and uses a consistent delimiter.
Header issues: Specify if your file does not contain a header row.
Inconsistent columns: Make sure all rows have the same number of columns.
Encoding issues: Ensure your file is UTF-8 encoded.
Unexpected results: Review your input file for consistency in structure and field names.

Schemas: Deephaven schema management overview.
JSON Schema Inference
JDBC Schema Inference
XML Schema Inference
Avro & Protobuf Schema Inference