Import HTML files

While Deephaven does not have its own methods for reading HTML tables, it's easy to do with pandas or BeautifulSoup. This guide will demonstrate multiple ways to pull data from online HTML tables into Deephaven tables.

HTML table

We'll use this HTML table in the examples below:

Color

Shades

primary#e5faff#d7f7ff#c8f4ff#b9f1ff#a9edff#99eaff#7fc6d9#65a4b4#4d8391#35636f#1f454f
secondary#6a81f2#596fe9#495ee0#384cd6#2439cb#0625c0#0121aa#001e94#001a7e#011569#031155
negative#f883a0#f87694#f76989#f65a7d#f54a72#f33666#da2e59#c1254c#a91d40#911534#7a0d28
positive#9dfc7e#93fa72#89f966#7ef758#72f549#65f336#58d92e#4cc025#3fa71d#348f15#28780d
warn#ffe590#ffe183#fedd75#fed966#fdd455#fdd041#e8be38#d4ac2f#c09b27#ac8a1e#997915
info#f08df9#ee7ff8#eb70f7#e960f6#e64df4#e336f3#cd34dd#b732c7#a22fb2#8d2c9d#792989
fg#f3f7fa#eef2f5#e9edf0#e4e9ec#dfe4e7#dadfe2#c6cbcf#b2b8bc#9fa5aa#8c9298#798086
bg#434f56#3c474d#353e44#2f363c#282f33#22272b#1f2327#1b2023#181c1f#15181b#121517

pandas.read_html

pandas.read_html is a convenient method for reading HTML tables into a list of pandas DataFrames. The method can be used to read tables from a URL or a local file.

We'll start with an example of how to use pandas.read_html to read an HTML table from a URL and convert it to a Deephaven table. This approach is simple and easy, but has some limitations - for instance, it does not handle the non-numerical values in this table gracefully:

from deephaven import pandas as dhpd
from deephaven import merge

import os

os.system("pip install lxml")

import pandas as pd
import lxml

deephaven_theme_colors = merge(
    [
        dhpd.to_table(df)
        for df in pd.read_html(
            "https://deephaven.io/core/docs/how-to-guides/data-import-export/html-import/"
        )
    ]
)

The sections below properly handle data types, leading to tables with actual data rather than just NaNs.

BeautifulSoup

BeautifulSoup is a Python library for pulling data out of HTML and XML files.

In this example, we'll use BeautifulSoup to read the same HTML table as in the example above and convert it to a Deephaven table:

import os

os.system("pip install beautifulsoup4 requests lxml")
from deephaven.column import string_col, int_col
from deephaven import new_table
from bs4 import BeautifulSoup
import requests

url = "https://deephaven.io/core/docs/how-to-guides/data-import-export/html-import/"
headers = {"User-Agent": "Mozilla/5.0"}  # helps avoid being blocked

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, "lxml")

# Try to find the table
table = soup.find("table")  # you can add class_="..." if needed

# Extract headers
headers = [th.text.strip() for th in table.find("thead").find_all("th")]

# Extract rows
rows = []
for tr in table.find("tbody").find_all("tr"):
    row = [td.text.strip() for td in tr.find_all("td")]
    if row:
        rows.append(row)

deephaven_theme_colors = new_table(
    [
        string_col("Color", [row[0] for row in rows]),
        string_col("Shades", [row[1] for row in rows]),
        string_col("Shades_1", [row[2] for row in rows]),
        string_col("Shades_2", [row[3] for row in rows]),
        string_col("Shades_3", [row[4] for row in rows]),
        string_col("Shades_4", [row[5] for row in rows]),
        string_col("Shades_5", [row[6] for row in rows]),
        string_col("Shades_6", [row[7] for row in rows]),
        string_col("Shades_7", [row[8] for row in rows]),
        string_col("Shades_8", [row[9] for row in rows]),
        string_col("Shades_9", [row[10] for row in rows]),
        string_col("Shades_10", [row[11] for row in rows]),
    ]
)

Data types

Since HTML tables store all data as plain text and have no concept of data types, some care must be taken when importing HTML tables into Deephaven to ensure that you end up with correct data types for each column. Deephaven's to_table method will automatically infer types as long as infer_objects=True, but to guarantee that the types are correct, manual specification is recommended.

Whether you are using pandas or BeautifulSoup, you can specify the data type of each column at either the DataFrame stage or by calling update to typecast columns after the Deephaven table has been created. This can be done by using the astype method in pandas or by using one of Deephaven's selection methods.

Typing with selection methods

In this example, we use the same HTML table as in the examples above as a source. We then read the table into Deephaven with BeautifulSoup and convert it to a Deephaven table with to_table. Finally, we restore the correct types fore each column with Deephaven's update method:

from deephaven import pandas as dhpd
import pandas as pd
import os

os.system("pip install beautifulsoup4 requests")
os.system("pip install requests")
os.system("pip install lxml")
from bs4 import BeautifulSoup
import requests

url = "https://deephaven.io/core/docs/how-to-guides/data-import-export/html-import/"
headers = {"User-Agent": "Mozilla/5.0"}  # helps avoid being blocked

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, "lxml")

# Try to find the table
table = soup.find("table")  # you can add class_="..." if needed

# Get basic headers from thead (usually just 'Color' and 'Shades')
basic_headers = [th.text.strip() for th in table.find("thead").find_all("th")]

# Extract rows
rows = []
for tr in table.find("tbody").find_all("tr"):
    row = [td.text.strip() for td in tr.find_all("td")]
    if row:
        rows.append(row)

# Create expanded headers to match the actual number of columns
# First column keeps its name (usually 'Color')
headers = [basic_headers[0]]
# For remaining columns, use second header name with index suffix if needed
for i in range(1, len(rows[0])):
    if i == 1:
        headers.append(basic_headers[1])  # Usually 'Shades'
    else:
        headers.append(f"{basic_headers[1]}_{i-1}")  # 'Shades_1', 'Shades_2', etc.

df = pd.DataFrame(rows, columns=headers)

# infer_objects is set to `True` be default
# in this case, all columns are inferred as Strings
deephaven_theme_colors = dhpd.to_table(df)

table_typed = deephaven_theme_colors.update(
    [
        "Color = String.valueOf(Color)",
        "Shades = String.valueOf(Shades)",
        "Shades_1 = String.valueOf(Shades_1)",
        "Shades_2 = String.valueOf(Shades_2)",
        "Shades_3 = String.valueOf(Shades_3)",
        "Shades_4 = String.valueOf(Shades_4)",
        "Shades_5 = String.valueOf(Shades_5)",
        "Shades_6 = String.valueOf(Shades_6)",
        "Shades_7 = String.valueOf(Shades_7)",
        "Shades_8 = String.valueOf(Shades_8)",
        "Shades_9 = String.valueOf(Shades_9)",
        "Shades_10 = String.valueOf(Shades_10)",
    ]
)

Note that the Pandas DataFrame.astype method can also be used to restore typing. However, it does not handle Deephaven's datetime types effectively, so the update method is recommended in those cases.

Typing with astype

This example demonstrates how to add typing to the deephaven_theme_colors table created above, using astype:

from deephaven import pandas as dhpd
import pandas as pd
import os

os.system("pip install beautifulsoup4 requests")
os.system("pip install requests")
os.system("pip install lxml")
from bs4 import BeautifulSoup
import requests

url = "https://deephaven.io/core/docs/how-to-guides/data-import-export/html-import/"
headers = {"User-Agent": "Mozilla/5.0"}  # helps avoid being blocked

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, "lxml")

# Try to find the table
table = soup.find("table")

# Get basic headers
basic_headers = [th.text.strip() for th in table.find("thead").find_all("th")]

# Extract rows
rows = []
for tr in table.find("tbody").find_all("tr"):
    row = [td.text.strip() for td in tr.find_all("td")]
    if row:
        rows.append(row)

# Create expanded headers to match the actual number of columns
# First column keeps its name (usually 'Color')
headers = [basic_headers[0]]
# For remaining columns, use second header name with index suffix if needed
for i in range(1, len(rows[0])):
    if i == 1:
        headers.append(basic_headers[1])  # Usually 'Shades'
    else:
        headers.append(f"{basic_headers[1]}_{i-1}")  # 'Shades_1', 'Shades_2', etc.

# create the dataframe and specify types for each column
df = pd.DataFrame(rows, columns=headers).astype(
    {
        "Color": "string",
        "Shades": "string",
        "Shades_1": "string",
        "Shades_2": "string",
        "Shades_3": "string",
        "Shades_4": "string",
        "Shades_5": "string",
        "Shades_6": "string",
        "Shades_7": "string",
        "Shades_8": "string",
        "Shades_9": "string",
        "Shades_10": "string",
    }
)

# convert to table - Deephaven correctly interprets "string" as a String
deephaven_theme_colors = dhpd.to_table(df)