Importing data without schemas
Sometimes you do not want or need to go through the process of creating and uploading schemas for data that you want to work with. Deephaven allows you to read data directly using industry standard formats such as CSV, JDBC, Parquet, and Apache Arrow. For CSV, JDBC, JSON, and XML Deephaven parses the file and stores the result in the query worker's memory, enabling efficient queries on the ephemeral data. As Parquet is a columnar storage format, Deephaven can efficiently process the data in-place, without needing to read the entire file into memory at once.
Once loaded into memory, these datasets can be saved to a Deephaven User Table, which can then be accessed directly using the standard Deephaven APIs.
Note
For CSV, JDBC, JSON, and XML formats, importing files containing more than 50,000 rows is not recommended.
Datasets that are read directly into memory are ephemeral and will not be persisted after closing the Query or Console into which the data was loaded.
Importing data from CSV
There are two ways to import data from CSV. The first, and simplest method, is to upload the file using the command menu from the Web UI, or the
Upload Delimited File action in the Swing UI. The other method is to use Deephaven's readCsv
command or one of its variants.
readCsv
readCsv
directly reads a file on the query server to a table object in the console. For example:
myTable = readCsv("path_to_file")
from deephaven.TableTools import readCsv
myTable = readCsv("path_to_file")
readCsv
has an optional second parameter that can specify an alternate format or a field delimiter. Formatting options include the following:
TRIM
- This format ignores leading or trailing spaces around a value that are not inside double quotes. When a second parameter toreadCSV
is not specified, theTRIM
format is used.DEFAULT
- Apache Commons CSV default format.EXCEL
- Microsoft Excel CSV format. (Note: Excel CSV files can usually be imported with theTRIM
orDEFAULT
format.)MYSQL
- MySQL CSV format.RFC4180
- IETF RFC 4180 MIME text/csv format.\t
- Tab-delimited format.
For example, the following can be used to import a tab-delimited text file:
// default case - assumes comma delimited and trims spaces not contained in quotes
myTable1 = readCsv("path_to_file") // equivalent to readCsv("path_to_file", "TRIM")
// tab-delimited file
myTable2 = readCsv("path_to_file", "\t")
// user specified delimiter - for example ";"
myTable3 = readCsv("path_to_file", ";")
from deephaven import *
# default case - assumes comma delimited and trims spaces not contained in quotes
myTable1 = ttools.readCsv("path_to_file") # equivalent to readCsv("path_to_file", "TRIM")
# tab-delimited file
myTable2 = ttools.readCsv("path_to_file", "\t")
# user specified delimiter - for example ";"
myTable3 = ttools.readCsv("path_to_file", ";")
Possibilities for a different field delimiter include semicolons (;), colons (:), pipes (|), and spaces. Any single character can be specified.
readHeaderlessCsv
The readHeaderlessCsv
method can read a CSV file that does not have column headers. It will either generate column names (Col1, Col2, etc.) or use a provided list of column names.
The syntax follows:
readHeaderlessCsv (filePath, header)
The argument must include the full file path, and - optionally - the user may provide column names in the header parameter as strings. Column data types are inferred from the data.
Importing data from JDBC
readJdbc
readJdbc
reads directly from a JDBC query result into an in-memory table. The metadata provided by the JDBC ResultSet is used to generate Deephaven columns of the most appropriate type. There are two variants of this function:
To read execute a query against a given JDBC data source:
myTable = readJdbc(<JDBC driver name>, <JDBC URL>, [<user>, <password>], <query>, [<options>])
from deephaven import *
myTable = JdbcHelpers.readJdbc(<JDBC driver name>, <JDBC URL>, [<user>, <password>], <query>, [<options>])
The user and password parameters are optional (some drivers take user and password in the URL), as is the options argument.
To read a given ResultSet already generated by a query executed (this will consume the ResultSet):
myTable = readJdbc(<Result Set>, [<options>])
from deephaven import *
myTable = JdbcHelpers.readJdbc(<Result Set>, [<options>])
The options may be used to specify the following:
Parameter Value | Description | Default |
---|---|---|
columnNameFormat(<src format>, <target format>) | Expected source and target column name format. This may be used to systematically convert column names (i.e., "my_col_name" to "MyColName"). | None |
maxRows(<maxRows>) | Maximum number of rows to read from the result set (typically a LIMIT or TOP clause in the SQL query is a better way to achieve this). | No Limit |
strict(<strict>) | Whether to apply strict mode when mapping JDBC to Deephaven (throw an exception instead of truncating when out of range values are encountered). | true |
sourceTimeZone(<time zone>) | Specify the time zone to use when interpreting JDBC dates and timestamps. | Data source time zone if possible, otherwise local. |
arrayDelimiter(<array delimiter>) | The delimiter to use when interpreting JDBC strings as array values. | "," (comma) |
columnTargetType(<column name>,<target type>) | The Deephaven type to use for the given column. Some SQL types can be mapped to multiple Deephaven types, in which this option can be used to specify an alternative to the default. | Default |
For example, the following can be used to import from a SQL Server data source:
newTable = readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
"jdbc:sqlserver://dbserverhost;database=dbname",
"myuser",
"mypassword",
"SELECT * FROM table1")
from deephaven import *
newTable = JdbcHelpers.readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
"jdbc:sqlserver://dbserverhost;database=dbname",
"myuser",
"mypassword",
"SELECT * FROM table1")
The following would import from the same source, but map the column names from a "lower underscore" format to "upper camel" format, and interpret SQL dates and timestamps as UTC. Note the use of the Google CaseFormat type.
import com.illumon.iris.utils.CasingStyle
newTable = readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
"jdbc:sqlserver://dbserverhost;database=dbname",
"myuser",
"mypassword",
"SELECT * FROM table1",
readJdbcOptions()
.sourceTimeZone("UTC")
.columnNameFormat(CasingStyle.UpperCamel, "_"))
from deephaven import *
newTable = JdbcHelpers.readJdbc("com.microsoft.sqlserver.jdbc.SQLServerDriver",
"jdbc:sqlserver://dbserverhost;database=dbname",
"myuser",
"mypassword",
"SELECT * FROM table1",
JdbcHelpers.readJdbcOptions()
.sourceTimeZone("UTC")
.columnNameFormat("UpperCamel", "_")
.readJdbcOptions # get the Java builder object from the Python object
)
Importing data from Parquet
Data can be imported from Parquet files using the ParquetTools
library. Deephaven currently supports only Single File table formats.
Single File Tables
Single files are typically named <something>.parquet
. These standalone files can be made available in Deephaven with the ParquetTools.readTable(String)
method. This method takes one or two arguments and returns a Table object. The single argument form takes the full path to a Parquet file. The two argument version allows specification of a ParquetInstructions
object, which can be used to override defaults for things like the CODEC to use, or dictionary handling for object columns.
Note
Paths to Parquet files are explicitly paths on the server on which the Deephaven worker is running, and the account under which the worker is running must have read access to the file.
import com.illumon.iris.db.v2.locations.parquet.ParquetTools
p = ParquetTools.readTable("/parquet/sample_data.parquet")
PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")
p = PT.readTable("/parquet/sample_data.parquet")
File sets
File Sets are groups of Parquet files, each containing part of the data for a table. There are two main types:
- single-level / metadata partitioned
- multi-level / key-partitioned sets
Single-level file sets use multiple files, typically named something like part.<n>.parquet
, where n
indicates the table fragment. Each of the part files contains a subset of the table. Usually, data in the part files is sorted by some overall key, and there will be a metadata file accompanying the set that contains information about which key values are in each part file. Deephaven cannot currently read the table as a file set, but it can read individual part files from the set using the readTable()
method.
import com.illumon.iris.db.v2.locations.parquet.ParquetTools
p = ParquetTools.readTable("/parquet/PQTest/fragmented_data/part.0.parquet")
PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")
p = PT.readTable("/parquet/PQTest/fragmented_data/part.0.parquet")
Multi-level sets are groups of files organized in a hierarchical set of directories, where directory names provide column names and values for one or more levels of data partitioning. Deephaven cannot currently read the table as a file set, but it can read individual part files from the set using the readTable()
method.
import "com.illumon.iris.db.v2.locations.parquet.ParquetTools"
p = ParquetTools.readTable("/parquet/PQTest/hierarchical_data/name=Section1/id_num=1/a7e04e4e5090417885a724ddb609151b.parquet")
PT=jpy.get_type("com.illumon.iris.db.v2.locations.parquet.ParquetTools")
p = PT.readTable("/parquet/PQTest/hierarchical_data/name=Section1/id_num=1/a7e04e4e5090417885a724ddb609151b.parquet")
Working with Parquet tables
Tables created using ParquetTools
are Deephaven tables backed directly by the Parquet file on disk. Initially, only some metadata is read into memory, so even if the file is large, opening it does not require much memory. The resultant table works like other Deephaven tables. Some typical operations that can be performed with Deephaven tables created with readTable()
include:
Saving Datasets as User Tables
Data imported into memory can be written back into the database as a User table.
Importing Data from Apache Arrow
Tables can be imported from Apache Arrow files using the ArrowTools
package. There is a single method, Table readArrow(String path)
, that will read the Arrow file into an In-Memory Deephaven table which can then be used like any other Deephaven Table. The table schema will be inferred from the Arrow schema using the following mappings:
Arrow Type | Deephaven Type |
---|---|
Bit | Boolean |
Uint1 | short |
Uint2 | char |
Uint4 | long |
Uint8 | BigInteger |
TinyInt | byte |
SmallInt | short |
Int | int |
BigInt | long |
VarChar | String |
FixedBinary | byte[] |
VarBinary | byte[] |
Float4 | float |
Float8 | double |
Decimal | BigDecimal |
TimeMilli | LocalTime |
TimeStamp | DBDateTime |
DateMilli | LocalDateTime |
import com.illumon.iris.db.arrow.ArrowTools
MyTable = ArrowTools.readArrow("/Path/to/my/file.arrow")