CSV Schema Inference
CSV schema inference in Deephaven streamlines the process of importing tabular data by automatically generating table schemas from your CSV files. This automation saves time and reduces errors, especially when working with new, evolving, or unfamiliar datasets.
The tool scans a sample of your data to infer column names, data types, and other schema details. Column names are typically derived from headers, but options exist for headerless files. For non-String columns, all values are checked to ensure the inferred type is valid for every entry. If a value doesn't fit the current type (for example, a float in an integer column), the type is promoted; if a String is encountered, the column is marked as a String and further type checking stops. If a column is empty for all rows, it is marked as a String and a warning is logged—review these cases before adding the schema to Deephaven. Date/time columns are handled specially: the tool attempts to match date/time strings to known formats, but if multiple formats are present, the column is marked as a String. Using a representative sample of your data helps ensure accurate inference and smoother onboarding.
When to use CSV schema inference
- Ingesting evolving or unfamiliar CSV data sources.
- Automating schema creation for new tables.
- Working with CSV files where manual schema definition is impractical.
Input requirements
- CSV file should have a header row with column names (or specify options for headerless files).
- Supported delimiters: comma (default), tab, semicolon, etc.
- Example of a valid CSV file:
name,age,score Alice,30,85.5 Bob,25,90.0
- Ensure consistent number of columns in each row.
- File should be UTF-8 encoded.
Example
This example creates CSVExampleNamespace.CSVExampleTableName.schema
in the /tmp
directory:
iris_exec csv_schema_creator -- --namespace CSVExampleNamespace --tableName CSVExampleTableName --sourceFile /data/sample.csv --schemaPath /tmp
--namespace
: The namespace for the new schema.--tableName
: The name of the table to create.--sourceFile
: Path to your CSV file.--schemaPath
: Output directory for the schema file.
Command Reference
iris_exec csv_schema_creator <launch args> -- <schema creator args>
The following arguments are available when running the CSV schema creator:
Argument | Description |
---|---|
-ns or --namespace <namespace> | (Required) The namespace to use for the new schema. |
-tn or --tableName <name> | (Required) The table name to use for the new schema. |
-sp or --schemaPath <path> | An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace. |
-sf or --sourceFile <file name or file path and name> | The name of the CSV file to read (required). This file must have a header row with column names. |
-fd or --delimiter <delimiter character> | Field delimiter (optional). Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, fileFormat is ignored. This must be a single character. |
-ff or --fileFormat <format name> | (Optional) The Apache Commons CSV parser is used to parse the file itself. Five common formats are supported:
|
-pc or --partitionColumn | Optional name for the partitioning column if schema is being generated. If not provided, the importer will default to "Date" for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]". |
-gc or --groupingColumn | Optional column name that should be marked as columnType="Grouping" in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns. |
-spc or --sourcePartitionColumn | Optional column to use for multi-partition imports. For example, if the partitioning column is "Date" and you want to enable multi-partition imports based on the column "source_date", specify "source_date" with this option (this is the column name in the data source, not the Deephaven column name). |
-sn or --sourceName <name for ImportSource block> | Optional name to use for the generated ImportSource block in the schema. If not provided, the default of "IrisCSV" will be used. |
-sl or --skipHeaderLines <integer value> | Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row. |
-fl or --setSkipFooterLines <integer value> | Optional number of footer lines to skip from the end of the file. |
-lp or --logProgress | If present, additional informational logging will be provided with progress updates during the parsing process. |
-bf or --bestFit | If present, the class will attempt to use the smallest numeric types to fit the data in the CSV. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will move to int and eventually long. The default behavior, without -bf , is to use long for all integer columns and double for all floating point columns. |
-tr or --trim | Similar to the TRIM file format, but adds leading/trailing whitespace trimming to any format. So, for a comma-delimited file with extra whitespace, -ff TRIM would be sufficient, but for a file using something other than a comma as its delimiter, the -tr option would be used in addition to -ff or -fd . |
-om or --outputMode | Either SAFE (default) or REPLACE . When SAFE , the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to REPLACE , a pre-existing file will be overwritten. |
Troubleshooting
- Malformed CSV: Ensure your file is well-formed and uses a consistent delimiter.
- Header issues: Specify if your file does not contain a header row.
- Inconsistent columns: Make sure all rows have the same number of columns.
- Encoding issues: Ensure your file is UTF-8 encoded.
- Unexpected results: Review your input file for consistency in structure and field names.
Related documentation
- Schemas: Deephaven schema management overview.
- JSON Schema Inference
- JDBC Schema Inference
- XML Schema Inference
- Avro & Protobuf Schema Inference