Data sets are bigger today than ever, often reaching more than billions of rows of data. Displaying them in a browser makes it possible for many users to examine large data sets without needing to install an application. Browsers are pretty powerful, but unfortunately, we can’t put a billion rows of data in the Browser Document Object Model (DOM) without browser performance taking a massive hit.
Ideally, users could explore and play with data of any size, transform it at will, and return to it as needed. This is possible when using smaller data frames in familiar tools like Jupyter, but this experience begins to go awry in typical data grid rendering solutions once the data surpasses a few million rows. In this post, we focus on the challenges with rendering data — in another post, we'll focus on the pipeline of transporting that data from the server.
We identify three challenges developers face when attempting to render large data sets:
- Lag observed between scroll actions and DOM updates is disruptive.
- Maximum size restrictions prevent users from seeing the bottom of a table.
- Technical limitations cause the cursor to jump to the wrong location.
So, how do we overcome these issues? Is it possible to display and interact with a billion rows of data efficiently in the browser? Can we aim even higher, and display a quadrillion rows of data? Let's explore some of the problems observed with DOM based data grids, and how we can work around those using a canvas based solution such as the @deephaven/grid package available on npm.
Problem 1: Lag observed between scroll actions and DOM updates
First, let’s try and reduce the DOM by displaying only the data that is in the current viewport. If you can only see 20 rows of data in your viewport, you only need to add 20 rows to the DOM. This is simple enough by having a large scrollable element that is the size of the table of data you have, then only displaying/positioning the cells that are visible in the current viewport. This certainly improves performance, but you need to update which cells are visible while the user is scrolling. To do this, you need to add a scroll listener, which must be a passive listener to prevent scroll jank.
With passive listeners, the browser will update the scroll position of the element without waiting for our listener to finish. We cannot update the position of the elements before the browser renders the scroll position. This results in things popping into view as the browser “catches up” to your new scroll position.
Problem 2: Size restrictions prevent scrolling to the bottom of the table
Even if we are willing to accept that there is a time after the scroll action occurs before data is positioned correctly, things fall apart when we start getting into the millions of rows. Depending on the browser, the maximum size of an element is limited to around 33,554,400 pixels. So if we do try and do a billion rows, we get stopped when we try to scroll to the bottom. We can’t even see four million rows:
Problem 3: Cursor jumps to the wrong location
Perhaps we can work around this limitation - what if we limit ourselves to a maximum size that is well within the limitations of the browser, and somehow virtualize the scrolling? We need to track whether the user is smoothly scrolling the viewport using the mouse wheel/touchpad gesture, or if they are dragging the bar to jump the viewport to another part of the table.
Aside from the technical limitation of not being able to programmatically determine whether the scroll action was triggered by the mouse wheel or by dragging the scroll bar, there is a drift that occurs using this heuristic - if you smooth scroll down by using the wheel long enough, then the scroll position of the bar will appear in the wrong location from where it actually should be in the list. Then if you try and drag the bar, you’ll jump to somewhere else in the data:
Now if we drag the scrollbar, it jumps trillions of rows; if we wheel back to the top, it gets stuck. We could spend a lot of time trying to refine this heuristic, but it will never feel natural. And we still haven’t addressed the DOM lag problem. These problems are apparent in many other grid libraries that claim to work with “big data”.
What if we take an entirely different approach?
Canvas to the rescue!
What if we avoid the browser DOM limitations entirely, and instead render our grid using the canvas? It’s more work to keep track of everything, but then we can provide better interactivity with data rendering immediately (no DOM lag), and we just have one canvas element the size of our viewport. Let’s try it with a quadrillion rows:
Success! Data is immediately visible while scrolling, dragging the scroll bar works as expected, and we can easily view the quadrillionth row of data. By avoiding the limitations of the browser DOM, we’re able to display and interact with extremely large data sets without compromise. A quadrillion rows should be enough for anyone!
Using canvas to render a large grid of data is not a unique idea - most notable, fin-hypergrid is also a canvas-based solution, but it is no longer maintained.
Try the grid below
This canvas-based grid is available on npm under the package name @deephaven/grid (Apache-2.0 License, more documentation coming soon). We extend this package with filtering, sorting, grouping, and more in our Deephaven web console. You can try the Deephaven demo app online or try scrolling the example grid below. Hook it up with your data set and get exploring!
Postscript: issues with other solutions
As we looked at other solutions, here are some typical problems we encountered that we addressed in @deephaven/grid: