How to include attributes of HTML table as a multiindex using Pandas?

Question:

I’m trying to read HTML from the following URL into a pandas dataframe:

https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/

The rendered HTML tables look like the following where there are N tables I’m interested in and 1 (the last one) that I’m not (i.e., I’m interested in the ones that don’t start with "No secondary metabolite"):
enter image description here

When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn’t the "No secondary metabolite" table but a concatenated table of the ones I’m interested in prefixed with "NZ_" in the header.

enter image description here

My question is if there is a way to include the headers of the rendered table as a multiindex?

For instance, I’m looking for a resulting table that looks like this:

# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")

# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))

# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))

# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)

# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))

# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat

enter image description here

Asked By: O.rka

||

Answers:

Try beautifulsoup to parse the HTML and construct the final dataframe:

import requests
import pandas as pd
from bs4 import BeautifulSoup


id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

dfs = []
for table in soup.select(".record-overview-details table"):
    header = table.find_previous(class_="record-overview-header").text.split()[
        0
    ]
    df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
    dfs.append(df)

final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)

Prints:

                                                       Type     From       To Most similar known cluster Most similar known cluster.1 Similarity
level_0         level_1       Region                                                                                                            
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1        terpene  1123901  1143342                 carotenoid                      Terpene        50%
                              Region&nbsp1.2    phosphonate  1252463  1293980                        NaN                          NaN        NaN
                              Region&nbsp1.3          T3PKS  1944360  1985445                        NaN                          NaN        NaN
                              Region&nbsp1.4        terpene  2690187  2709232                        NaN                          NaN        NaN
                              Region&nbsp1.5        terpene  4260236  4281054                  surfactin              NRP:Lipopeptide        13%
                              Region&nbsp1.6    siderophore  4446861  4463436                        NaN                          NaN        NaN
                NZ_CP041065.1 Region&nbsp3.1  lanthipeptide    98352   124802                        NaN                          NaN        NaN
Answered By: Andrej Kesely