JSON Schema Inference

JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same way as it does on CSV data. Deephaven can generate schemas from JSON files, making it easy to ingest semi-structured data. This article explains how to infer schemas from JSON sources and what to consider when generating schemas from JSON data.

When importing data from JSON, Deephaven inspects the structure and data types in the sample file to infer column names and types. The schema generator will recursively analyze objects and arrays, mapping them to Deephaven-compatible types. For complex or nested data, Deephaven will attempt to flatten structures or use supported representations for arrays and objects.

When to use JSON schema inference

Ingesting semi-structured or evolving data from JSON files.
Automating schema creation for new data sources.
Working with nested or hierarchical data.

Input requirements

The input file should contain one or more top-level JSON objects (not a single array or comma-separated list).
Each object may be on its own line, or multiple objects may appear on a single line.
Example of a simple input file representing a list of objects:

{"a":1}
{"a":2}
{"a":3}

Nested objects and arrays are supported; Deephaven will flatten or enumerate them as columns.

Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.

Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the --fieldPathSeparator argument. Array elements are numbered with a zero-based index.

For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2" and a row with values 1, 2, 3, 4, 5, 6:

{ "a": 1, "b": { "c": 2, "d": 3 }, "e": [4, 5, 6] }

Here is a more realistic example of a complex input file representing a list of objects with nested objects and arrays:

[
  {
    "order_id": 1001,
    "customer": {
      "customer_id": "C123",
      "name": "Alice Smith",
      "email": "alice.smith@example.com"
    },
    "order_date": "2024-05-01T14:30:00Z",
    "items": [
      {
        "product_id": "P001",
        "product_name": "Laptop",
        "quantity": 1,
        "price": 1200.0
      },
      {
        "product_id": "P002",
        "product_name": "Mouse",
        "quantity": 2,
        "price": 25.5
      }
    ],
    "total_amount": 1251.0,
    "status": "Shipped",
    "shipping_address": {
      "street": "123 Main St",
      "city": "Springfield",
      "state": "IL",
      "postal_code": "62701"
    }
  }
]

Example

This example creates JSONExampleNamespace.JSONExampleTableName.schema in the /tmp directory:

iris_exec json_schema_creator -- --namespace JSONExampleNamespace --tableName JSONExampleTableName --sourceFile /data/sample.json --schemaPath /tmp

Where:

--namespace: The namespace for the new schema.
--tableName: The name of the table to create.
--sourceFile: Path to your JSON file.
--schemaPath: Output directory for the generated schema file.

Command reference

iris_exec json_schema_creator <launch args> -- <schema creator args>

The JSON schema creator takes the following arguments:

Argument	Description
`-ns` or `--namespace <namespace>`	(Required) Namespace in which to find the target table.
`-tn` or `--tableName <name>`	(Required) Name of the target table.
`sp` or `--schemaPath <path>`	An optional path to which the schema file will be written. If not specified, this defaults to the current working directory and will create or use a subdirectory that matches the namespace.
`-sf` or `--sourceFile <exact file name>`	(Required) The name of the JSON file to read (required).
`-pc` or `--partitionColumn`	Optional name for the partitioning column if schema is being generated. If not provided, the importer will default to Date for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to "source_[original column name]".
`-gc` or `--groupingColumn`	Optional column name that should be marked as `columnType="Grouping"` in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
`-spc` or `--sourcePartitionColumn`	Optional column to use for multi-partition imports. For example if the partitioning column is "Date" and you want to enable multi-partition imports based on the column source_date, specify "source_date" with this option (Note: This is the column name in the data source, not the Deephaven column name).
`-sn` or `--sourceName <name for ImportSource block>`	Optional name to use for the generated `ImportSource` block in the schema. If not provided, the default of "IrisJSON" will be used.
`-sl` or `--skipLines <integer value>`	Optional number of lines to skip from the beginning of the file before expecting the header row. If not provided, the first line is used as the header row.
`-lp` or `--logProgress`	If present, additional informational logging will be provided with progress updates during the parsing process.
`-bf` or `--bestFit`	If present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without `-bf`, is to use long for all integer columns and double for all floating point columns.
`-om` or `--outputMode`	Either `SAFE` (default) or `REPLACE`. When `SAFE`, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to `REPLACE`, a pre-existing file will be overwritten.
`-fps` or `--fieldPathSeparator <separator>`	If present, specifies a character or string to use when generating column names from nested JSON data. Defaults to underscore (`_`). For example, if set to `--fieldPathSeparator .`, nested fields will produce column names like `customer.name` instead of `customer_name`.

Customizing nested field names with `--fieldPathSeparator`

By default, nested JSON fields are flattened using an underscore (_) as the separator. You can customize this with the --fieldPathSeparator argument:

iris_exec json_schema_creator -- --namespace JSONExampleNamespace --tableName JSONExampleTableName --sourceFile /data/sample.json --schemaPath /tmp --fieldPathSeparator .

This will produce columns such as customer.name, customer.email, etc., instead of customer_name, customer_email.

Troubleshooting

Malformed JSON: Ensure your file is valid JSON. Use a linter/validator if needed.
Missing columns: Make sure all objects have the expected fields or review the inferred schema.
Deeply nested data: Adjust the --fieldPathSeparator argument or preprocess your JSON.
Encoding issues: Ensure your file is UTF-8 encoded.
Unexpected results: Review your input file for consistency in structure and field names.