CSV Schema Inference

CSV schema inference in Deephaven streamlines the process of importing tabular data by automatically generating table schemas from your CSV files. This automation saves time and reduces errors, especially when working with new, evolving, or unfamiliar datasets.

The tool scans a sample of your data to infer column names, data types, and other schema details. Column names are typically derived from headers, but options exist for headerless files. For non-String columns, all values are checked to ensure the inferred type is valid for every entry. If a value doesn't fit the current type (for example, a float in an integer column), the type is promoted; if a String is encountered, the column is marked as a String. Using a representative sample of your data helps ensure accurate inference and smoother onboarding.

When to use CSV schema inference

CSV schema inference is particularly useful in scenarios such as:

Ingesting evolving or unfamiliar CSV data sources.
Automating schema creation for new tables.
Working with CSV files where manual schema definition is impractical.

It's important to review the generated schemas to ensure they meet your requirements.

Input requirements

To use CSV schema inference, ensure your CSV file meets the following criteria:

The file must have a header row with column names, or you must specify options for headerless files and provide column names.
The file must use a consistent delimiter (e.g., comma, tab, semicolon). The default is a comma (,).
The file must have a consistent number of columns in each row.
The file must be UTF-8 encoded.

Here's an example of a valid CSV file:

Location,First Name,ID
New York,Alice,12
New York,Bowen,123
Los Angeles,Carmelo,1234
Los Angeles,Darius,99998
Los Angeles,Elisha,99999

GUI usage

CSV schema inference can be accessed through the Deephaven web interface. See the Schema Editor documentation for more information on how to do this.

Code Studio usage

You can use a Core+ Groovy Code Studio to create a schema from a CSV file.

The following example demonstrates how to use a CsvSchemaCreator builder to generate a schema.

It uses the namespace test and table name TestSchemaCreation.
It uses a source CSV file located at /tmp/TestSchemaCreation.csv. This CSV file must be accessible from the Deephaven server where the Code Studio is running.

If the example CSV file above is located at /tmp/TestSchemaCreation.csv on the Merge_1 node, you can run the following code in a Core+ Groovy Code Studio on the Merge_1 node:

import io.deephaven.importers.CsvSchemaCreator
import io.deephaven.importers.util.CasingStyle
import io.deephaven.base.verify.Assert
import io.deephaven.importers.csv.CsvFormats

namespace = "test"
tableName = "TestSchemaCreation"
importSourceName = "ImportSource"
partitioningColumn = "Location"
sourcePartitioningColumn = "Location"

sourceFile = new File("/tmp/TestSchemaCreation.csv")

schemaCreator = CsvSchemaCreator.builder()
  .namespace(namespace)
  .tableName(tableName)
  .sourceFile(sourceFile)
  .sourceName(importSourceName)
  .fileFormat(CsvFormats.DEFAULT)
  .bestFit(true)
  .delimiter(',' as char)
  .maxRows(10)
  .partitionColumn(partitioningColumn)
  .sourcePartitionColumn(sourcePartitioningColumn)
  .casingStyle(CasingStyle.UpperCamel)
  .build()

schemaString = schemaCreator.inferSchemaFromCsv()

println(schemaString)

More information on schemas is available in the schema documentation.

Once you have generated the schema, you can add it to Deephaven in the usual ways.

Builder options

The builder defines the behavior of the schema inference process. For example, skipHeaderLines(<number of lines>) can be used to skip non-header lines at the start of the file. For full documentation, see the API docs for the io.deephaven.importers.CsvSchemaCreator.Builder class.

Troubleshooting

Malformed CSV: Ensure your file is well-formed and uses a consistent delimiter.
Header issues: Specify if your file does not contain a header row.
Inconsistent columns: Make sure all rows have the same number of columns.
Encoding issues: Ensure your file is UTF-8 encoded.
Unexpected results: Review your input file for consistency in structure and field names.

Schemas: Deephaven schema management overview.
JSON Schema Inference
JDBC Schema Inference
XML Schema Inference
Avro & Protobuf Schema Inference