Importing data without schemas

Sometimes you do not want or need to go through the process of creating and uploading schemas for data that you want to work with. Deephaven allows you to read data directly using industry standard formats such as CSV, JDBC, and Parquet. For CSV, JDBC, JSON, and XML Deephaven parses the file and stores the result in the query worker's memory, enabling efficient queries on the ephemeral data. As Parquet is a columnar storage format, Deephaven can efficiently process the data in-place, without needing to read the entire file into memory at once.

Once loaded into memory, these datasets can be saved to a Deephaven User Table, which can then be accessed directly using the standard Deephaven APIs.

Note

For CSV, JDBC, JSON, and XML formats, importing files containing more than 50,000 rows is not recommended.

Datasets that are read directly into memory are ephemeral and will not be persisted after closing the Query or Console into which the data was loaded.

Importing data from CSV

There are two ways to import data from CSV. The first, and simplest method, is to upload the file using the command menu from the Web UI, or the

Upload Delimited File action in the Swing UI. The other method is to use Deephaven's readCsv command or one of its variants.

readCsv

readCsv directly reads a file on the query server to a table object in the console. For example:


myTable = readCsv("path_to_file")

from deephaven.TableTools import readCsv

myTable = readCsv("path_to_file")

readCsv has an optional second parameter that can specify an alternate format or a field delimiter. Formatting options include the following:

  • TRIM - This format ignores leading or trailing spaces around a value that are not inside double quotes. When a second parameter to readCSV is not specified, the TRIM format is used.
  • DEFAULT - Apache Commons CSV default format.
  • EXCEL - Microsoft Excel CSV format. (Note: Excel CSV files can usually be imported with the TRIM or DEFAULT format.)
  • MYSQL - MySQL CSV format.
  • RFC4180 - IETF RFC 4180 MIME text/csv format.
  • \t - Tab-delimited format.

For example, the following can be used to import a tab-delimited text file:

// default case - assumes comma delimited and trims spaces not contained in quotes
myTable1 = readCsv("path_to_file")  // equivalent to readCsv("path_to_file", "TRIM")

// tab-delimited file
myTable2 = readCsv("path_to_file", "\t")

// user specified delimiter - for example ";"
myTable3 = readCsv("path_to_file", ";")
from deephaven import *

# default case - assumes comma delimited and trims spaces not contained in quotes
myTable1 = ttools.readCsv("path_to_file")  # equivalent to readCsv("path_to_file", "TRIM")

# tab-delimited file
myTable2 = ttools.readCsv("path_to_file", "\t")

# user specified delimiter - for example ";"
myTable3 = ttools.readCsv("path_to_file", ";")

Possibilities for a different field delimiter include semicolons (;), colons (:), pipes (|), and spaces. Any single character can be specified.

readHeaderlessCsv

The readHeaderlessCsv method can read a CSV file that does not have column headers. It will either generate column names (Col1, Col2, etc.) or use a provided list of column names.

The syntax follows:

  • readHeaderlessCsv (filePath, header)

The argument must include the full file path, and - optionally - the user may provide column names in the header parameter as strings. Column data types are inferred from the data.

Note

See the Javadoc or PyDocs for full details and other options.

Importing data from JDBC

readJdbc

readJdbc reads directly from a JDBC query result into an in-memory table. The metadata provided by the JDBC ResultSet is used to generate Deephaven columns of the most appropriate type. There are two variants of this function:

To read execute a query against a given JDBC data source:

myTable = readJdbc(<JDBC driver name>, <JDBC URL>, [<user>, <password>], <query>, [<options>])
from deephaven import *

myTable = JdbcHelpers.readJdbc(<JDBC driver name>, <JDBC URL>, [<user>, <password>], <query>, [<options>])

The user and password parameters are optional (some drivers take user and password in the URL), as is the options argument.

To read a given ResultSet already generated by a query executed (this will consume the ResultSet):

myTable = readJdbc(<Result Set>, [<options>])
from deephaven import *

myTable = JdbcHelpers.readJdbc(<Result Set>, [<options>])

The options may be used to specify the following:

Parameter ValueDescriptionDefault
columnNameFormat(<src format>, <target format>)Expected source and target column name format. This may be used to systematically convert column names (i.e., "my_col_name" to "MyColName").None
maxRows(<maxRows>)Maximum number of rows to read from the result set (typically a LIMIT or TOP clause in the SQL query is a better way to achieve this).No Limit
strict(<strict>)Whether to apply strict mode when mapping JDBC to Deephaven (throw an exception instead of truncating when out of range values are encountered).true
sourceTimeZone(<time zone>)Specify the time zone to use when interpreting JDBC dates and timestamps.Data source time zone if possible, otherwise local.
arrayDelimiter(<array delimiter>)The delimiter to use when interpreting JDBC strings as array values."," (comma)
columnTargetType(<column name>,<target type>)The Deephaven type to use for the given column. Some SQL types can be mapped to multiple Deephaven types, in which this option can be used to specify an alternative to the default.Default

For example, the following can be used to import from a SQL Server data source:

newTable = readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
                    "jdbc:sqlserver://dbserverhost;database=dbname",
                    "myuser",
                    "mypassword",
                    "SELECT * FROM table1")
from deephaven import *

newTable = JdbcHelpers.readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
                                "jdbc:sqlserver://dbserverhost;database=dbname",
                                "myuser",
                                "mypassword",
                                "SELECT * FROM table1")

The following would import from the same source, but map the column names from a "lower underscore" format to "upper camel" format, and interpret SQL dates and timestamps as UTC. Note the use of the Google CaseFormat type.

import com.illumon.iris.utils.CasingStyle

newTable = readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
	                "jdbc:sqlserver://dbserverhost;database=dbname",
	                "myuser",
	                "mypassword",
	                "SELECT * FROM table1",
	                readJdbcOptions()
	                    .sourceTimeZone("UTC")
	                    .columnNameFormat(CasingStyle.UpperCamel, "_"))
from deephaven import *

newTable = JdbcHelpers.readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
                                "jdbc:sqlserver://dbserverhost;database=dbname",
                                "myuser",
                                "mypassword",
                                "SELECT * FROM table1",
                                JdbcHelpers.readJdbcOptions()
                                    .sourceTimeZone("UTC")
                                    .columnNameFormat("UpperCamel", "_")
                                    .readJdbcOptions  # get the Java builder object from the Python object
                                )

Importing data from Parquet

Data can be imported from Parquet files using the ParquetTools library. Deephaven currently supports only Single File table formats.

Single File Tables

Single files are typically named <something>.parquet. These standalone files can be made available in Deephaven with the ParquetTools.readTable(String) method. This method takes one or two arguments and returns a Table object. The single argument form takes the full path to a Parquet file. The two argument version allows specification of a ParquetInstructions object, which can be used to override defaults for things like the CODEC to use, or dictionary handling for object columns.

Note

Paths to Parquet files are explicitly paths on the server on which the Deephaven worker is running, and the account under which the worker is running must have read access to the file.


import com.illumon.iris.db.v2.locations.parquet.ParquetTools

p = ParquetTools.readTable("/parquet/sample_data.parquet")

PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")

p = PT.readTable("/parquet/sample_data.parquet")

File sets

File Sets are groups of Parquet files, each containing part of the data for a table. There are two main types:

  • single-level / metadata partitioned
  • multi-level / key-partitioned sets

Single-level file sets use multiple files, typically named something like part.<n>.parquet, where n indicates the table fragment. Each of the part files contains a subset of the table. Usually, data in the part files is sorted by some overall key, and there will be a metadata file accompanying the set that contains information about which key values are in each part file. Deephaven cannot currently read the table as a file set, but it can read individual part files from the set using the readTable() method.

import com.illumon.iris.db.v2.locations.parquet.ParquetTools

p = ParquetTools.readTable("/parquet/PQTest/fragmented_data/part.0.parquet")
PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")

p = PT.readTable("/parquet/PQTest/fragmented_data/part.0.parquet")

Multi-level sets are groups of files organized in a hierarchical set of directories, where directory names provide column names and values for one or more levels of data partitioning. Deephaven cannot currently read the table as a file set, but it can read individual part files from the set using the readTable() method.

import "com.illumon.iris.db.v2.locations.parquet.ParquetTools"

p = ParquetTools.readTable("/parquet/PQTest/hierarchical_data/name=Section1/id_num=1/a7e04e4e5090417885a724ddb609151b.parquet")
PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")

p = PT.readTable("/parquet/PQTest/hierarchical_data/name=Section1/id_num=1/a7e04e4e5090417885a724ddb609151b.parquet")

Working with Parquet tables

Tables created using ParquetTools are Deephaven tables backed directly by the Parquet file on disk. Initially, only some metadata is read into memory, so even if the file is large, opening it does not require much memory. The resultant table works like other Deephaven tables. Some typical operations that can be performed with Deephaven tables created with readTable() include:

Saving Datasets as User Tables

Data imported into memory can be written back into the database as a User table.