Removing top header from Dataframe

Question

i would like to read table from following page :
Countries by GDP

i have tried pandas read_html command and got following result :

import requests

    from bs4 import BeautifulSoup
    import pandas as pd
    data =pd.DataFrame(pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2])
    print(data.head())
  Country/Territory UN Region  ... United Nations[15]           
  Country/Territory UN Region  ...           Estimate       Year
0             World         —  ...           85328323       2020
1     United States  Americas  ...           20893746       2020
2             China      Asia  ...           14722801  [n 1]2020
3             Japan      Asia  ...            5057759       2020
4           Germany    Europe  ...            3846414       2020

question is : how can i remove first content?for instance we might iterate through all rows :

for index, row  in data.iterrows():
    print(index,row)

and create empty dataframe and all elements starting by index one(so remove first index and save rest of them in empty dataframe), but i am sure there exist more professional way(maybe we need little help from the regular expression? or BeautifulSoup? ) thanks in advance

Edited

with help of great person, i have solved this problem, but there is one additional issue, please look at table below :

Country/Territory UN Region   Estimate  ...       Year  Estimate       Year
0             World         —  101560901  ...       2021  85328323       2020
1     United States  Americas   25035164  ...       2021  20893746       2020
2             China      Asia   18321197  ...  [n 3]2021  14722801  [n 1]2020
3             Japan      Asia    4300621  ...       2021   5057759       2020
4           Germany    Europe    4031149  ...       2021   3846414       2020

you can see additional nonnecessary symbols [n 1] in front of same data, maybe because of NaN or something like this, can we just filter out such data? and leave rest of them?

Edited :
i have tried to create one common function and then apply to all Year column, here is very simple example : if we extract only first element :

result =data.loc[2,"Year"][0]

which is equal to : [n 1]2022

then of course i can split it :

print(result.split("]")[1])

now it is : 2022 based on this information i have created following function :

def  split_column(text):
   if  len(text)==4:
       return int(text)
   else:
       return int(text.split("]")[1])

logic is that if len(text)==4 means that we have string consist of only 4 digit(like a 2020) and therefore directly convert to number, else apply logic mentioned in the previous example , but when i run :

data['Year'] =data['Year'].apply(split_column,axis=1)

it gives me error :

AttributeError: 'Series' object has no attribute 'split'. Did you mean: 'plot'?

is not it strange? thanks in advance

Solved

finally it is solved with given code :

def  split_column(text):
   if  len(text)==4:
       return int(text)
   elif text=="—":
       return 0
   else:
       return int(text.split("]")[1])

here is complete code :

import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
def  split_column(text):
   if  len(text)==4:
       return int(text)
   elif text=="—":
       return 0
   else:
       return int(text.split("]")[1])
data = pd.DataFrame(pd.read_html(wiki_link)[2]).droplevel(0, axis=1).loc[:, lambda x: ~x.columns.duplicated()]
data.dropna(inplace=True)
print(data['Year'].value_counts())
print(data['Year'])
# data['Year'] =data['Year'].map(split_column)
# for col in data.columns:
#     print(col)
#     # if col=='Year':
#     #     data[col] =data[col].map(split_column)
data['Year'] =data['Year'].map(split_column)
print(data.head())
# result =data.loc[2,"Year"][0]
# print(result.split("]")[1])
# print(result)
# print(data.head())
# print(data.columns)
# print(data['Year'].loc[2,:][0].split("]")[1])
# print(len(data['Year'].iloc[0,0]))

Asked By: neural science

||

Source

Answer 1

Seems like the columns are a MultiIndex, so you can use droplevel :

wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
    
data = (
        pd.DataFrame(pd.read_html(wiki_link)[2])
            .droplevel(0, axis=1) # <- we drop here the top header
            .loc[:, lambda x: ~x.columns.duplicated()]
            .query(r'~Year.str.contains(r"[")')
     )

Output :

print(data)

    Country/Territory UN Region   Estimate  Year
0               World         —  101560901  2022
1       United States  Americas   25035164  2022
3               Japan      Asia    4300621  2022
..                ...       ...        ...   ...
214             Nauru   Oceania        134  2022
215        Montserrat  Americas          —     —
216            Tuvalu   Oceania         64  2022

[207 rows x 4 columns]

Update :

Based on @Corralien comment, it seems like you want this :

df = pd.DataFrame(pd.read_html(wiki_link)[2])
     
out = (
        df.iloc[:, :2].droplevel(0, axis=1)
            .join(df.iloc[:, 2:]
                  .pipe(lambda df_: df_.set_axis([f"{c2} ({c1.split('[')[0]})"
                        for c1,c2 in df_.columns], axis=1)))
            .replace(r"[.*]", "", regex=True)
    )

Output :

 print(out)

    Country/Territory UN Region Estimate (IMF) Year (IMF)  
0               World         —      101560901       2022   
1       United States  Americas       25035164       2022   
2               China      Asia       18321197       2022   
..                ...       ...            ...        ...   
214             Nauru   Oceania            134       2022   
215        Montserrat  Americas              —          —   
216            Tuvalu   Oceania             64       2022   

    Estimate (World Bank) Year (World Bank) Estimate (United Nations)  
0                96513077              2021                  85328323   
1                22996100              2021                  20893746   
2                17734063              2021                  14722801   
..                    ...               ...                       ...   
214                   133              2021                       135   
215                     —                 —                        68   
216                    63              2021                        55   

    Year (United Nations)  
0                    2020  
1                    2020  
2                    2020  
..                    ...  
214                  2020  
215                  2020  
216                  2020  

[217 rows x 8 columns]

Answered By: Timeless

Answer 2

You should parse manually the HTML table:

import pandas as pd
import numpy as np
import requests
import bs4
import re
import io

# Get HTML table
resp = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
soup = bs4.BeautifulSoup(resp.text)
table = soup.find('caption', text=re.compile('GDP.*by country')).parent

# Create headers
header = pd.MultiIndex.from_product([['IMF', 'World Bank', 'United Nations'], ['Estimate', 'Year']], names=['Source', 'Data'])

# Parse data
data = {}
for row in table.find_all('tr', {'class': None}):
    cols = row.find_all('td')
    key = cols[0].text.strip(), cols[1].text.strip()
    data[key] = []
    for col in cols[2:]:
        if col.has_attr('colspan'):
            data[key].append(np.nan)  # Estimate
            data[key].append(pd.NA)   # Year
            continue
        val = int(re.sub('[.*]', '', col.text).strip().replace(',', ''))
        data[key].append(val)

df = pd.DataFrame.from_dict(data, orient='index', columns=header)
df.index = pd.MultiIndex.from_tuples(df.index, names=['Country', 'Territory'])

Output:

>>> df
Source                          IMF        World Bank       United Nations      
Data                       Estimate  Year    Estimate  Year       Estimate  Year
Country       Territory                                                         
United States Americas   25035164.0  2022  22996100.0  2021     20893746.0  2020
China         Asia       18321197.0  2022  17734063.0  2021     14722801.0  2020
Japan         Asia        4300621.0  2022   4937422.0  2021      5057759.0  2020
Germany       Europe      4031149.0  2022   4223116.0  2021      3846414.0  2020
India         Asia        3468566.0  2022   3173398.0  2021      2664749.0  2020
...                             ...   ...         ...   ...            ...   ...
Palau         Oceania         226.0  2022       218.0  2021          264.0  2020
Kiribati      Oceania         207.0  2022       207.0  2021          181.0  2020
Nauru         Oceania         134.0  2022       133.0  2021          135.0  2020
Montserrat    Americas          NaN  <NA>         NaN  <NA>           68.0  2020
Tuvalu        Oceania          64.0  2022        63.0  2021           55.0  2020

[216 rows x 6 columns]

Now you can reshape your dataframe:

>>> pd.concat([df['IMF'], df['World Bank'], df['United Nations']], keys=df.columns.levels[0])

Data                                  Estimate  Year
Source     Country       Territory                  
IMF        United States Americas   25035164.0  2022
           China         Asia       18321197.0  2022
           Japan         Asia        4300621.0  2022
           Germany       Europe      4031149.0  2022
           India         Asia        3468566.0  2022
...                                        ...   ...
World Bank Palau         Oceania         264.0  2020
           Kiribati      Oceania         181.0  2020
           Nauru         Oceania         135.0  2020
           Montserrat    Americas         68.0  2020
           Tuvalu        Oceania          55.0  2020

[648 rows x 2 columns]

Aggregate Estimate by mean(Territory, Country, Year) from different sources:

>>> (pd.concat([df['IMF'], df['World Bank'], df['United Nations']])
       .groupby(['Territory', 'Country', 'Year'])['Estimate'].mean())

Territory  Country  Year
Africa     Algeria  2020    147689.0
                    2021    167983.0
                    2022    187155.0
           Angola   2020     62307.0
                    2021     72547.0
                              ...   
Oceania    Tuvalu   2021        63.0
                    2022        64.0
           Vanuatu  2020       855.0
                    2021       984.0
                    2022       984.0
Name: Estimate, Length: 608, dtype: float64

Answered By: Corralien

Removing top header from Dataframe

Question:

Answers: