What is a more efficient or "pythonic" way to clean column headers?
Question:
I’m pulling some data from the pro football reference website. All of the information pulled fine, but the column headers are a bit messy. I wrote some code to clean it up, but it doesn’t quite feel "right." It seems a bit too repetitive as I keep reassigning the same variable over in the same for loop.
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
# Cleaning column headers
col_headers = []
regex = re.compile('[()'':_0-9]')
for x in df.columns:
y = (str(x).replace('1st', 'First'))
y = (y.replace('%', 'Pct'))
y = regex.sub('', y)
y = y.strip('Unnamed level, ')
col_headers.append(y)
In the code above, I returned my desired column header output to a list where I will then reassign the column names accordingly. However, I feel like I’m not approaching the problem efficiently and am wondering if anyone has any advise on how to better structure this part of my code.
Answers:
Maybe something like this, you example works fine also.
def clean_column_header(column):
regex = re.compile('[()'':_0-9]')
column = column.replace("1st", "First")
.replace("%", "Pct")
.strip("Unnamed level, ")
return regex.sub("", column)
df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})
column name is tuple. so use as follows:
Code:
cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)
Output:
['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']
you can use .str
methods for pd.Index
:
df.columns = (df.columns
.str.replace('1st', 'First')
.str.replace('%', 'Pct')
.str.replace(r'[()'':_0-9]', '')
.str.strip('Unnamed level, '))
Transform multi-level index column to dataframe and handle it.
obj_col = pd.DataFrame(df.columns.tolist())
# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])
# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
.str.replace ('%', 'Pct'))
# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
.str.strip(', '))
print(obj_col)
output
0 1 0_ 1_ col_name
0 Unnamed: 0_level_0 Rk Rk Rk
1 Unnamed: 1_level_0 Tm Tm Tm
2 Unnamed: 2_level_0 G G G
3 Unnamed: 3_level_0 PF PF PF
4 Unnamed: 4_level_0 Yds Yds Yds
5 Tot Yds & TO Ply Tot Yds & TO Ply Tot Yds & TO, Ply
6 Tot Yds & TO Y/P Tot Yds & TO Y/P Tot Yds & TO, Y/P
7 Tot Yds & TO TO Tot Yds & TO TO Tot Yds & TO, TO
8 Unnamed: 8_level_0 FL FL FL
9 Unnamed: 9_level_0 1stD FirstD FirstD
10 Passing Cmp Passing Cmp Passing, Cmp
11 Passing Att Passing Att Passing, Att
12 Passing Yds Passing Yds Passing, Yds
13 Passing TD Passing TD Passing, TD
14 Passing Int Passing Int Passing, Int
15 Passing NY/A Passing NY/A Passing, NY/A
16 Passing 1stD Passing FirstD Passing, FirstD
17 Rushing Att Rushing Att Rushing, Att
18 Rushing Yds Rushing Yds Rushing, Yds
19 Rushing TD Rushing TD Rushing, TD
20 Rushing Y/A Rushing Y/A Rushing, Y/A
21 Rushing 1stD Rushing FirstD Rushing, FirstD
22 Penalties Pen Penalties Pen Penalties, Pen
23 Penalties Yds Penalties Yds Penalties, Yds
24 Penalties 1stPy Penalties FirstPy Penalties, FirstPy
25 Unnamed: 25_level_0 Sc% ScPct ScPct
26 Unnamed: 26_level_0 TO% TOPct TOPct
27 Unnamed: 27_level_0 EXP EXP EXP
There is a rename attribute of pd.DataFrame
that you can utilize directly for such cases:
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
regex = re.compile('[()'':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)
I’m pulling some data from the pro football reference website. All of the information pulled fine, but the column headers are a bit messy. I wrote some code to clean it up, but it doesn’t quite feel "right." It seems a bit too repetitive as I keep reassigning the same variable over in the same for loop.
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
# Cleaning column headers
col_headers = []
regex = re.compile('[()'':_0-9]')
for x in df.columns:
y = (str(x).replace('1st', 'First'))
y = (y.replace('%', 'Pct'))
y = regex.sub('', y)
y = y.strip('Unnamed level, ')
col_headers.append(y)
In the code above, I returned my desired column header output to a list where I will then reassign the column names accordingly. However, I feel like I’m not approaching the problem efficiently and am wondering if anyone has any advise on how to better structure this part of my code.
Maybe something like this, you example works fine also.
def clean_column_header(column):
regex = re.compile('[()'':_0-9]')
column = column.replace("1st", "First")
.replace("%", "Pct")
.strip("Unnamed level, ")
return regex.sub("", column)
df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})
column name is tuple. so use as follows:
Code:
cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)
Output:
['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']
you can use .str
methods for pd.Index
:
df.columns = (df.columns
.str.replace('1st', 'First')
.str.replace('%', 'Pct')
.str.replace(r'[()'':_0-9]', '')
.str.strip('Unnamed level, '))
Transform multi-level index column to dataframe and handle it.
obj_col = pd.DataFrame(df.columns.tolist())
# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])
# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
.str.replace ('%', 'Pct'))
# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
.str.strip(', '))
print(obj_col)
output
0 1 0_ 1_ col_name
0 Unnamed: 0_level_0 Rk Rk Rk
1 Unnamed: 1_level_0 Tm Tm Tm
2 Unnamed: 2_level_0 G G G
3 Unnamed: 3_level_0 PF PF PF
4 Unnamed: 4_level_0 Yds Yds Yds
5 Tot Yds & TO Ply Tot Yds & TO Ply Tot Yds & TO, Ply
6 Tot Yds & TO Y/P Tot Yds & TO Y/P Tot Yds & TO, Y/P
7 Tot Yds & TO TO Tot Yds & TO TO Tot Yds & TO, TO
8 Unnamed: 8_level_0 FL FL FL
9 Unnamed: 9_level_0 1stD FirstD FirstD
10 Passing Cmp Passing Cmp Passing, Cmp
11 Passing Att Passing Att Passing, Att
12 Passing Yds Passing Yds Passing, Yds
13 Passing TD Passing TD Passing, TD
14 Passing Int Passing Int Passing, Int
15 Passing NY/A Passing NY/A Passing, NY/A
16 Passing 1stD Passing FirstD Passing, FirstD
17 Rushing Att Rushing Att Rushing, Att
18 Rushing Yds Rushing Yds Rushing, Yds
19 Rushing TD Rushing TD Rushing, TD
20 Rushing Y/A Rushing Y/A Rushing, Y/A
21 Rushing 1stD Rushing FirstD Rushing, FirstD
22 Penalties Pen Penalties Pen Penalties, Pen
23 Penalties Yds Penalties Yds Penalties, Yds
24 Penalties 1stPy Penalties FirstPy Penalties, FirstPy
25 Unnamed: 25_level_0 Sc% ScPct ScPct
26 Unnamed: 26_level_0 TO% TOPct TOPct
27 Unnamed: 27_level_0 EXP EXP EXP
There is a rename attribute of pd.DataFrame
that you can utilize directly for such cases:
import pandas as pd
import re
# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]
regex = re.compile('[()'':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)