Import HTML files
While Deephaven does not have its own methods for reading HTML tables, it's easy to do with pandas
or BeautifulSoup. This guide will demonstrate multiple ways to pull data from online HTML tables into Deephaven tables.
pandas.read_html
pandas.read_html
is a convenient method for reading HTML tables into a list of pandas DataFrames. The method can be used to read tables from a URL or a local file.
Here's an example of how to use pandas.read_html
to read an HTML table from a URL and convert it to a Deephaven table:
from deephaven import pandas as dhpd
from deephaven import merge
import os
os.system("pip install lxml")
import pandas as pd
import lxml
football_scores = merge(
[
dhpd.to_table(df)
for df in pd.read_html(
"https://www.theguardian.com/football/premierleague/table"
)
]
)
- football_scores
BeautifulSoup
BeautifulSoup is a Python library for pulling data out of HTML and XML files.
In this example, we'll use BeautifulSoup to read the same HTML table as in the example above and convert it to a Deephaven table:
import os
os.system("pip install beautifulsoup4 requests lxml")
from deephaven.column import string_col, int_col
from deephaven import new_table
from bs4 import BeautifulSoup
import requests
url = "https://www.theguardian.com/football/premierleague/table"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)
headers = []
rows = []
for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])
ranks = [int(row[0]) for row in rows]
teams = [row[1] for row in rows]
gp = [int(row[2]) for row in rows]
football_scores = new_table(
[
int_col("Rank", [int(row[0]) for row in rows]),
string_col("Team", [row[1] for row in rows]),
int_col("GamesPlayed", [int(row[2]) for row in rows]),
int_col("Wins", [int(row[3]) for row in rows]),
int_col("Draws", [int(row[4]) for row in rows]),
int_col("Losses", [int(row[5]) for row in rows]),
int_col("GoalsFor", [int(row[6]) for row in rows]),
int_col("GoalsAgainst", [int(row[7]) for row in rows]),
int_col("GoalDiff", [int(row[8]) for row in rows]),
int_col("Points", [int(row[9]) for row in rows]),
string_col("Form", [row[10] for row in rows]),
]
)
- football_scores
Data types
Since HTML tables store all data as plain text and have no concept of data types, some care must be taken when importing HTML tables into Deephaven to ensure that you end up with correct data types for each column. Deephaven's to_table
method will automatically infer types as long as infer_objects=True
, but to guarantee that the types are correct, manual specification is recommended.
Whether you are using pandas
or BeautifulSoup, you can specify the data type of each column at either the DataFrame stage or by calling update
to typecast columns after the Deephaven table has been created. This can be done by using the astype
method in pandas
or by using one of Deephaven's selection methods.
Typing with selection methods
In this example, we use the same HTML table as in the examples above as a source. We then read the table into Deephaven with BeautifulSoup and convert it to a Deephaven table with to_table
. Finally, we restore the correct types fore each column with Deephaven's update
method:
from deephaven import pandas as dhpd
import pandas as pd
import os
os.system("pip install beautifulsoup4 requests")
os.system("pip install requests")
os.system("pip install lxml")
from bs4 import BeautifulSoup
import requests
url = "https://www.theguardian.com/football/premierleague/table"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)
headers = []
rows = []
for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])
df = pd.DataFrame(rows, columns=headers)
# infer_objects is set to `True` be default
# in this case, all columns are inferred as Strings
football_scores = dhpd.to_table(df)
table_typed = football_scores.update(
[
"P = Integer.valueOf(P)",
"GP = Integer.valueOf(GP)",
"W = Integer.valueOf(W)",
"D = Integer.valueOf(D)",
"L = Integer.valueOf(L)",
"F = Integer.valueOf(F)",
"A = Integer.valueOf(A)",
"GD = Integer.valueOf(GD)",
"Pts = Integer.valueOf(Pts)",
]
)
- table_typed
Note that the Pandas DataFrame.astype
method can also be used to restore typing. However, it does not handle Deephaven's datetime types effectively, so the update
method is recommended in those cases.
Typing with astype
This example demonstrates how to add typing to the football_scores
table created above, using astype
:
import os
os.system("pip install beautifulsoup4 requests")
os.system("pip install requests lxml")
from deephaven import pandas as dhpd
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = "https://www.theguardian.com/football/premierleague/table"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find(
"table",
class_="table table--football table--league-table table--responsive-font table--striped",
)
headers = []
rows = []
for i, row in enumerate(table.find_all("tr")):
if i == 0:
headers = [el.text.strip() for el in row.find_all("th")]
else:
rows.append([el.text.strip() for el in row.find_all("td")])
# create the dataframe and specify types for each column
df = pd.DataFrame(rows, columns=headers).astype(
{
"P": "int32",
"Team": "string",
"GP": "int32",
"W": "int32",
"D": "int32",
"L": "int32",
"F": "int32",
"A": "int32",
"GD": "int32",
"Pts": "int32",
"Form": "string",
}
)
# convert to table - Deephaven correctly interprets "int32" as an int and "string" as a String
football_scores = dhpd.to_table(df)
- football_scores