Removing top header from Dataframe
Question:
i would like to read table from following page :
Countries by GDP
i have tried pandas read_html command and got following result :
import requests
from bs4 import BeautifulSoup
import pandas as pd
data =pd.DataFrame(pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2])
print(data.head())
Country/Territory UN Region ... United Nations[15]
Country/Territory UN Region ... Estimate Year
0 World — ... 85328323 2020
1 United States Americas ... 20893746 2020
2 China Asia ... 14722801 [n 1]2020
3 Japan Asia ... 5057759 2020
4 Germany Europe ... 3846414 2020
question is : how can i remove first content?for instance we might iterate through all rows :
for index, row in data.iterrows():
print(index,row)
and create empty dataframe and all elements starting by index one(so remove first index and save rest of them in empty dataframe), but i am sure there exist more professional way(maybe we need little help from the regular expression? or BeautifulSoup? ) thanks in advance
Edited
with help of great person, i have solved this problem, but there is one additional issue, please look at table below :
Country/Territory UN Region Estimate ... Year Estimate Year
0 World — 101560901 ... 2021 85328323 2020
1 United States Americas 25035164 ... 2021 20893746 2020
2 China Asia 18321197 ... [n 3]2021 14722801 [n 1]2020
3 Japan Asia 4300621 ... 2021 5057759 2020
4 Germany Europe 4031149 ... 2021 3846414 2020
you can see additional nonnecessary symbols [n 1] in front of same data, maybe because of NaN or something like this, can we just filter out such data? and leave rest of them?
Edited :
i have tried to create one common function and then apply to all Year column, here is very simple example : if we extract only first element :
result =data.loc[2,"Year"][0]
which is equal to : [n 1]2022
then of course i can split it :
print(result.split("]")[1])
now it is : 2022 based on this information i have created following function :
def split_column(text):
if len(text)==4:
return int(text)
else:
return int(text.split("]")[1])
logic is that if len(text)==4 means that we have string consist of only 4 digit(like a 2020) and therefore directly convert to number, else apply logic mentioned in the previous example , but when i run :
data['Year'] =data['Year'].apply(split_column,axis=1)
it gives me error :
AttributeError: 'Series' object has no attribute 'split'. Did you mean: 'plot'?
is not it strange? thanks in advance
Solved
finally it is solved with given code :
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
here is complete code :
import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
data = pd.DataFrame(pd.read_html(wiki_link)[2]).droplevel(0, axis=1).loc[:, lambda x: ~x.columns.duplicated()]
data.dropna(inplace=True)
print(data['Year'].value_counts())
print(data['Year'])
# data['Year'] =data['Year'].map(split_column)
# for col in data.columns:
# print(col)
# # if col=='Year':
# # data[col] =data[col].map(split_column)
data['Year'] =data['Year'].map(split_column)
print(data.head())
# result =data.loc[2,"Year"][0]
# print(result.split("]")[1])
# print(result)
# print(data.head())
# print(data.columns)
# print(data['Year'].loc[2,:][0].split("]")[1])
# print(len(data['Year'].iloc[0,0]))
Answers:
Seems like the columns are a MultiIndex, so you can use droplevel
:
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
data = (
pd.DataFrame(pd.read_html(wiki_link)[2])
.droplevel(0, axis=1) # <- we drop here the top header
.loc[:, lambda x: ~x.columns.duplicated()]
.query(r'~Year.str.contains(r"[")')
)
Output :
print(data)
Country/Territory UN Region Estimate Year
0 World — 101560901 2022
1 United States Americas 25035164 2022
3 Japan Asia 4300621 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
[207 rows x 4 columns]
Update :
Based on @Corralien comment, it seems like you want this :
df = pd.DataFrame(pd.read_html(wiki_link)[2])
out = (
df.iloc[:, :2].droplevel(0, axis=1)
.join(df.iloc[:, 2:]
.pipe(lambda df_: df_.set_axis([f"{c2} ({c1.split('[')[0]})"
for c1,c2 in df_.columns], axis=1)))
.replace(r"[.*]", "", regex=True)
)
Output :
print(out)
Country/Territory UN Region Estimate (IMF) Year (IMF)
0 World — 101560901 2022
1 United States Americas 25035164 2022
2 China Asia 18321197 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
Estimate (World Bank) Year (World Bank) Estimate (United Nations)
0 96513077 2021 85328323
1 22996100 2021 20893746
2 17734063 2021 14722801
.. ... ... ...
214 133 2021 135
215 — — 68
216 63 2021 55
Year (United Nations)
0 2020
1 2020
2 2020
.. ...
214 2020
215 2020
216 2020
[217 rows x 8 columns]
You should parse manually the HTML table:
import pandas as pd
import numpy as np
import requests
import bs4
import re
import io
# Get HTML table
resp = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
soup = bs4.BeautifulSoup(resp.text)
table = soup.find('caption', text=re.compile('GDP.*by country')).parent
# Create headers
header = pd.MultiIndex.from_product([['IMF', 'World Bank', 'United Nations'], ['Estimate', 'Year']], names=['Source', 'Data'])
# Parse data
data = {}
for row in table.find_all('tr', {'class': None}):
cols = row.find_all('td')
key = cols[0].text.strip(), cols[1].text.strip()
data[key] = []
for col in cols[2:]:
if col.has_attr('colspan'):
data[key].append(np.nan) # Estimate
data[key].append(pd.NA) # Year
continue
val = int(re.sub('[.*]', '', col.text).strip().replace(',', ''))
data[key].append(val)
df = pd.DataFrame.from_dict(data, orient='index', columns=header)
df.index = pd.MultiIndex.from_tuples(df.index, names=['Country', 'Territory'])
Output:
>>> df
Source IMF World Bank United Nations
Data Estimate Year Estimate Year Estimate Year
Country Territory
United States Americas 25035164.0 2022 22996100.0 2021 20893746.0 2020
China Asia 18321197.0 2022 17734063.0 2021 14722801.0 2020
Japan Asia 4300621.0 2022 4937422.0 2021 5057759.0 2020
Germany Europe 4031149.0 2022 4223116.0 2021 3846414.0 2020
India Asia 3468566.0 2022 3173398.0 2021 2664749.0 2020
... ... ... ... ... ... ...
Palau Oceania 226.0 2022 218.0 2021 264.0 2020
Kiribati Oceania 207.0 2022 207.0 2021 181.0 2020
Nauru Oceania 134.0 2022 133.0 2021 135.0 2020
Montserrat Americas NaN <NA> NaN <NA> 68.0 2020
Tuvalu Oceania 64.0 2022 63.0 2021 55.0 2020
[216 rows x 6 columns]
Now you can reshape your dataframe:
>>> pd.concat([df['IMF'], df['World Bank'], df['United Nations']], keys=df.columns.levels[0])
Data Estimate Year
Source Country Territory
IMF United States Americas 25035164.0 2022
China Asia 18321197.0 2022
Japan Asia 4300621.0 2022
Germany Europe 4031149.0 2022
India Asia 3468566.0 2022
... ... ...
World Bank Palau Oceania 264.0 2020
Kiribati Oceania 181.0 2020
Nauru Oceania 135.0 2020
Montserrat Americas 68.0 2020
Tuvalu Oceania 55.0 2020
[648 rows x 2 columns]
Aggregate Estimate by mean(Territory, Country, Year) from different sources:
>>> (pd.concat([df['IMF'], df['World Bank'], df['United Nations']])
.groupby(['Territory', 'Country', 'Year'])['Estimate'].mean())
Territory Country Year
Africa Algeria 2020 147689.0
2021 167983.0
2022 187155.0
Angola 2020 62307.0
2021 72547.0
...
Oceania Tuvalu 2021 63.0
2022 64.0
Vanuatu 2020 855.0
2021 984.0
2022 984.0
Name: Estimate, Length: 608, dtype: float64
i would like to read table from following page :
Countries by GDP
i have tried pandas read_html command and got following result :
import requests
from bs4 import BeautifulSoup
import pandas as pd
data =pd.DataFrame(pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2])
print(data.head())
Country/Territory UN Region ... United Nations[15]
Country/Territory UN Region ... Estimate Year
0 World — ... 85328323 2020
1 United States Americas ... 20893746 2020
2 China Asia ... 14722801 [n 1]2020
3 Japan Asia ... 5057759 2020
4 Germany Europe ... 3846414 2020
question is : how can i remove first content?for instance we might iterate through all rows :
for index, row in data.iterrows():
print(index,row)
and create empty dataframe and all elements starting by index one(so remove first index and save rest of them in empty dataframe), but i am sure there exist more professional way(maybe we need little help from the regular expression? or BeautifulSoup? ) thanks in advance
Edited
with help of great person, i have solved this problem, but there is one additional issue, please look at table below :
Country/Territory UN Region Estimate ... Year Estimate Year
0 World — 101560901 ... 2021 85328323 2020
1 United States Americas 25035164 ... 2021 20893746 2020
2 China Asia 18321197 ... [n 3]2021 14722801 [n 1]2020
3 Japan Asia 4300621 ... 2021 5057759 2020
4 Germany Europe 4031149 ... 2021 3846414 2020
you can see additional nonnecessary symbols [n 1] in front of same data, maybe because of NaN or something like this, can we just filter out such data? and leave rest of them?
Edited :
i have tried to create one common function and then apply to all Year column, here is very simple example : if we extract only first element :
result =data.loc[2,"Year"][0]
which is equal to : [n 1]2022
then of course i can split it :
print(result.split("]")[1])
now it is : 2022 based on this information i have created following function :
def split_column(text):
if len(text)==4:
return int(text)
else:
return int(text.split("]")[1])
logic is that if len(text)==4 means that we have string consist of only 4 digit(like a 2020) and therefore directly convert to number, else apply logic mentioned in the previous example , but when i run :
data['Year'] =data['Year'].apply(split_column,axis=1)
it gives me error :
AttributeError: 'Series' object has no attribute 'split'. Did you mean: 'plot'?
is not it strange? thanks in advance
Solved
finally it is solved with given code :
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
here is complete code :
import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
data = pd.DataFrame(pd.read_html(wiki_link)[2]).droplevel(0, axis=1).loc[:, lambda x: ~x.columns.duplicated()]
data.dropna(inplace=True)
print(data['Year'].value_counts())
print(data['Year'])
# data['Year'] =data['Year'].map(split_column)
# for col in data.columns:
# print(col)
# # if col=='Year':
# # data[col] =data[col].map(split_column)
data['Year'] =data['Year'].map(split_column)
print(data.head())
# result =data.loc[2,"Year"][0]
# print(result.split("]")[1])
# print(result)
# print(data.head())
# print(data.columns)
# print(data['Year'].loc[2,:][0].split("]")[1])
# print(len(data['Year'].iloc[0,0]))
Seems like the columns are a MultiIndex, so you can use droplevel
:
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
data = (
pd.DataFrame(pd.read_html(wiki_link)[2])
.droplevel(0, axis=1) # <- we drop here the top header
.loc[:, lambda x: ~x.columns.duplicated()]
.query(r'~Year.str.contains(r"[")')
)
Output :
print(data)
Country/Territory UN Region Estimate Year
0 World — 101560901 2022
1 United States Americas 25035164 2022
3 Japan Asia 4300621 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
[207 rows x 4 columns]
Update :
Based on @Corralien comment, it seems like you want this :
df = pd.DataFrame(pd.read_html(wiki_link)[2])
out = (
df.iloc[:, :2].droplevel(0, axis=1)
.join(df.iloc[:, 2:]
.pipe(lambda df_: df_.set_axis([f"{c2} ({c1.split('[')[0]})"
for c1,c2 in df_.columns], axis=1)))
.replace(r"[.*]", "", regex=True)
)
Output :
print(out)
Country/Territory UN Region Estimate (IMF) Year (IMF)
0 World — 101560901 2022
1 United States Americas 25035164 2022
2 China Asia 18321197 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
Estimate (World Bank) Year (World Bank) Estimate (United Nations)
0 96513077 2021 85328323
1 22996100 2021 20893746
2 17734063 2021 14722801
.. ... ... ...
214 133 2021 135
215 — — 68
216 63 2021 55
Year (United Nations)
0 2020
1 2020
2 2020
.. ...
214 2020
215 2020
216 2020
[217 rows x 8 columns]
You should parse manually the HTML table:
import pandas as pd
import numpy as np
import requests
import bs4
import re
import io
# Get HTML table
resp = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
soup = bs4.BeautifulSoup(resp.text)
table = soup.find('caption', text=re.compile('GDP.*by country')).parent
# Create headers
header = pd.MultiIndex.from_product([['IMF', 'World Bank', 'United Nations'], ['Estimate', 'Year']], names=['Source', 'Data'])
# Parse data
data = {}
for row in table.find_all('tr', {'class': None}):
cols = row.find_all('td')
key = cols[0].text.strip(), cols[1].text.strip()
data[key] = []
for col in cols[2:]:
if col.has_attr('colspan'):
data[key].append(np.nan) # Estimate
data[key].append(pd.NA) # Year
continue
val = int(re.sub('[.*]', '', col.text).strip().replace(',', ''))
data[key].append(val)
df = pd.DataFrame.from_dict(data, orient='index', columns=header)
df.index = pd.MultiIndex.from_tuples(df.index, names=['Country', 'Territory'])
Output:
>>> df
Source IMF World Bank United Nations
Data Estimate Year Estimate Year Estimate Year
Country Territory
United States Americas 25035164.0 2022 22996100.0 2021 20893746.0 2020
China Asia 18321197.0 2022 17734063.0 2021 14722801.0 2020
Japan Asia 4300621.0 2022 4937422.0 2021 5057759.0 2020
Germany Europe 4031149.0 2022 4223116.0 2021 3846414.0 2020
India Asia 3468566.0 2022 3173398.0 2021 2664749.0 2020
... ... ... ... ... ... ...
Palau Oceania 226.0 2022 218.0 2021 264.0 2020
Kiribati Oceania 207.0 2022 207.0 2021 181.0 2020
Nauru Oceania 134.0 2022 133.0 2021 135.0 2020
Montserrat Americas NaN <NA> NaN <NA> 68.0 2020
Tuvalu Oceania 64.0 2022 63.0 2021 55.0 2020
[216 rows x 6 columns]
Now you can reshape your dataframe:
>>> pd.concat([df['IMF'], df['World Bank'], df['United Nations']], keys=df.columns.levels[0])
Data Estimate Year
Source Country Territory
IMF United States Americas 25035164.0 2022
China Asia 18321197.0 2022
Japan Asia 4300621.0 2022
Germany Europe 4031149.0 2022
India Asia 3468566.0 2022
... ... ...
World Bank Palau Oceania 264.0 2020
Kiribati Oceania 181.0 2020
Nauru Oceania 135.0 2020
Montserrat Americas 68.0 2020
Tuvalu Oceania 55.0 2020
[648 rows x 2 columns]
Aggregate Estimate by mean(Territory, Country, Year) from different sources:
>>> (pd.concat([df['IMF'], df['World Bank'], df['United Nations']])
.groupby(['Territory', 'Country', 'Year'])['Estimate'].mean())
Territory Country Year
Africa Algeria 2020 147689.0
2021 167983.0
2022 187155.0
Angola 2020 62307.0
2021 72547.0
...
Oceania Tuvalu 2021 63.0
2022 64.0
Vanuatu 2020 855.0
2021 984.0
2022 984.0
Name: Estimate, Length: 608, dtype: float64