Skip to main content
Version: Python

Import HTML files

While Deephaven does not have its own methods for reading HTML tables, it's easy to do with pandas or BeautifulSoup. This guide will demonstrate multiple ways to pull data from online HTML tables into Deephaven tables.

pandas.read_html

pandas.read_html is a convenient method for reading HTML tables into a list of pandas DataFrames. The method can be used to read tables from a URL or a local file.

Here's an example of how to use pandas.read_html to read an HTML table from a URL and convert it to a Deephaven table:

from deephaven import pandas as dhpd
from deephaven import merge

import os

os.system("pip install lxml")

import pandas as pd
import lxml

football_scores = merge(
[
dhpd.to_table(df)
for df in pd.read_html(
"https://www.theguardian.com/football/premierleague/table"
)
]
)

BeautifulSoup

BeautifulSoup is a Python library for pulling data out of HTML and XML files.

In this example, we'll use BeautifulSoup to read the same HTML table as in the example above and convert it to a Deephaven table:

import os

os.system("pip install beautifulsoup4 requests lxml")
from deephaven.column import string_col, int_col
from deephaven import new_table
from bs4 import BeautifulSoup
import requests

url = "https://www.theguardian.com/football/premierleague/table"

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")

table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)

headers = []
rows = []

for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])

ranks = [int(row[0]) for row in rows]
teams = [row[1] for row in rows]
gp = [int(row[2]) for row in rows]

football_scores = new_table(
[
int_col("Rank", [int(row[0]) for row in rows]),
string_col("Team", [row[1] for row in rows]),
int_col("GamesPlayed", [int(row[2]) for row in rows]),
int_col("Wins", [int(row[3]) for row in rows]),
int_col("Draws", [int(row[4]) for row in rows]),
int_col("Losses", [int(row[5]) for row in rows]),
int_col("GoalsFor", [int(row[6]) for row in rows]),
int_col("GoalsAgainst", [int(row[7]) for row in rows]),
int_col("GoalDiff", [int(row[8]) for row in rows]),
int_col("Points", [int(row[9]) for row in rows]),
string_col("Form", [row[10] for row in rows]),
]
)

Data types

Since HTML tables store all data as plain text and have no concept of data types, some care must be taken when importing HTML tables into Deephaven to ensure that you end up with correct data types for each column. Deephaven's to_table method will automatically infer types as long as infer_objects=True, but to guarantee that the types are correct, manual specification is recommended.

Whether you are using pandas or BeautifulSoup, you can specify the data type of each column at either the DataFrame stage or by calling update to typecast columns after the Deephaven table has been created. This can be done by using the astype method in pandas or by using one of Deephaven's selection methods.

Typing with selection methods

In this example, we use the same HTML table as in the examples above as a source. We then read the table into Deephaven with BeautifulSoup and convert it to a Deephaven table with to_table. Finally, we restore the correct types fore each column with Deephaven's update method:

from deephaven import pandas as dhpd
import pandas as pd
import os

os.system("pip install beautifulsoup4 requests")
os.system("pip install requests")
os.system("pip install lxml")
from bs4 import BeautifulSoup
import requests

url = "https://www.theguardian.com/football/premierleague/table"

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")

table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)

headers = []
rows = []

for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])

df = pd.DataFrame(rows, columns=headers)

# infer_objects is set to `True` be default
# in this case, all columns are inferred as Strings
football_scores = dhpd.to_table(df)

table_typed = football_scores.update(
[
"P = Integer.valueOf(P)",
"GP = Integer.valueOf(GP)",
"W = Integer.valueOf(W)",
"D = Integer.valueOf(D)",
"L = Integer.valueOf(L)",
"F = Integer.valueOf(F)",
"A = Integer.valueOf(A)",
"GD = Integer.valueOf(GD)",
"Pts = Integer.valueOf(Pts)",
]
)

Note that the Pandas DataFrame.astype method can also be used to restore typing. However, it does not handle Deephaven's datetime types effectively, so the update method is recommended in those cases.

Typing with astype

This example demonstrates how to add typing to the football_scores table created above, using astype:

import os

os.system("pip install beautifulsoup4 requests")
os.system("pip install requests lxml")

from deephaven import pandas as dhpd
import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://www.theguardian.com/football/premierleague/table"

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")

table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)

headers = []
rows = []

for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])

# create the dataframe and specify types for each column
df = pd.DataFrame(rows, columns=headers).astype(
{
"P": "int32",
"Team": "string",
"GP": "int32",
"W": "int32",
"D": "int32",
"L": "int32",
"F": "int32",
"A": "int32",
"GD": "int32",
"Pts": "int32",
"Form": "string",
}
)

# convert to table - Deephaven correctly interprets "int32" as an int and "string" as a String
football_scores = dhpd.to_table(df)