S3 Table Storage
Amazon S3 (Simple Storage Service) is a scalable object storage service provided by AWS. It is commonly used for storing and retrieving large amounts of data, making it a suitable backend for Deephaven's table storage. This document outlines the steps required to configure and use S3 as a table storage backend for Deephaven, leveraging the Goofys tool to mount S3 buckets as a local filesystem on an AWS EC2 instance.
Warning
S3 performance may be significantly slower than other table storage backends. See the Performance section below for more information.
Configuration
To configure S3 as a table storage backend for Deephaven, you need to set up and mount the S3 bucket using Goofys. This section provides step-by-step instructions on installing and configuring Goofys on an AWS EC2 instance, ensuring that your S3 storage is properly mounted and accessible for Deephaven to use as a backend. Follow the steps below to get started.
Goofys
Goofys mounts an S3 store and exposes it as a mounted user-space filesystem on an AWS EC2 host. Follow the instructions below to install Goofys on your EC2 Linux host.
Install Fuse:
sudo dnf install fuse
Confirm that the fuse libs are installed:
sudo dnf list installed | grep fuse
fuse-overlayfs.x86_64 0.7.2-6.el7_8 @extras
fuse3-libs.x86_64 3.6.1-4.el7 @extras
Download the Goofys binary:
wget https://github.com/kahing/goofys/releases/latest/download/goofys
Confirm that the Goofys binary is installed:
./goofys -v
Goofys mounts as a specific user and group, and these cannot be changed once mounted because file mode, owner, and group are not supported POSIX behaviors. To find the UID and GID for the dbquery
user, use the following command:
grep dbq /etc/passwd
This will output something similar to:
dbquery:x:9001:9003:Deephaven Data Labs dbquery Account:/db/TempFiles//dbquery:/bin/bash
The UID is 9001
and the GID is 9003
.
To mount the S3 store as a normal user for testing, create a directory that will be backed by S3 and mount it as the dbquery
user:
mkdir ~/s3-share-goofy
./goofys --uid 9001 --gid 9003 --debug_fuse --debug_s3 my-s3-project /var/tmp/s3-share-goofy
To create a permanent mount, see the Goofys Documentation for details on adding the mount to /etc/fstab
.
Deephaven configuration
To use the Goofys mount as a table storage backend, see the Filesystem data layout page for details on linking the storage into the Database root directory.
Performance
Tests were conducted using Goofys to mount an S3 store and expose it as a user-space filesystem on an AWS EC2 host, comparing it to a similar NFS-mounted store. As expected, queries run on S3/Goofys data took longer than their NFS counterparts.
The chart below summarizes the performance of queries run on data exposed via Goofys, relative to similar queries on data exposed via NFS. The REL_TO_NFS
column is a multiplier that shows how Goofys performance compares to its NFS counterpart, with the NFS value always being 1. For instance, the Goofys query on Deephaven V1 format data has a relative value of 1.92, indicating it took nearly twice as long. Generally, queries on Goofys ran close to twice as long as those on NFS. Increasing the data buffer size to 4M and console to 8G widened the disparity, with Goofys queries running approximately 2.5 times longer than NFS.
The queries used in the test follow.
Note
These queries are from an older version of Deephaven and may not run in the current version without modification.
Query 1
import com.illumon.iris.db.tables.utils.*
import com.illumon.iris.db.tables.select.QueryScope
ojt = io.deephaven.OuterJoinTools
run_bar_creation = {String namespace, String date, int interval ->
QueryScope.addParam("interval",interval)
quote_bars = db.t("${namespace}", "EquityQuoteL1")
.where("Date=`${date}`")
.updateView("Timestamp = DBTimeUtils.lowerBin(Timestamp, interval * SECOND)")
.dropColumns("MarketTimestamp","ServerTimestamp","InternalCode","LocalCodeMarketId","TradingStatus")
.avgBy("Date","LocalCodeStr","Timestamp")
trade_bars = db.t("${namespace}", "EquityTradeL1")
.where("Date=`${date}`")
.updateView("Timestamp = DBTimeUtils.lowerBin(Timestamp, interval \* SECOND)")
.dropColumns("MarketTimestamp","ServerTimestamp","InternalCode","LocalCodeMarketId","MarketId","TradingStatus")
.avgBy("Date","LocalCodeStr","Timestamp")
bars = ojt.fullOuterJoin(trade_bars, quote_bars, "Date,LocalCodeStr,Timestamp")
bars = bars.select()
int sz = bars.size()
quote_bars.close()
trade_bars.close()
bars.close()
return sz
}
ns = (“FeedOS” | “FeedOS_S3” | "FeedOSPQ" | “FeedOSPQ_S3”)
dates = ['2022-05-13', '2022-05-16', '2022-05-17', '2022-05-18', '2022-05-19']
dates.each { date ->
long start = System.currentTimeMillis()
int sz = run_bar_creation.call("${ns}", date, 60)
long t = System.currentTimeMillis() - start
println "${date}: sumBy done in ${t / 1000}s"
}
Query 2
import com.illumon.iris.db.tables.utils.*
ns = ("FeedOSPQ" | “FeedOSPQ_S3”)
dates = ['2022-05-13', '2022-05-16', '2022-05-17', '2022-05-18', '2022-05-19']
dates.each { date ->
long start = System.currentTimeMillis()
sumTable = db.t("${ns}", "EquityQuoteL1").where("Date = `${date}`")
.view("LocalCodeStr", "LocalCodeMarketId", "BidSize")
.sumBy("LocalCodeStr", "LocalCodeMarketId")
long t = System.currentTimeMillis() - start
println "${date}: ${sz} rows in ${t / 1000}s"
}