What is a more efficient or "pythonic" way to clean column headers?

Question

I’m pulling some data from the pro football reference website. All of the information pulled fine, but the column headers are a bit messy. I wrote some code to clean it up, but it doesn’t quite feel "right." It seems a bit too repetitive as I keep reassigning the same variable over in the same for loop.

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

# Cleaning column headers
col_headers = []
regex = re.compile('[()'':_0-9]')
for x in df.columns:
    y = (str(x).replace('1st', 'First'))
    y = (y.replace('%', 'Pct'))
    y = regex.sub('', y)
    y = y.strip('Unnamed level, ')
    col_headers.append(y)

In the code above, I returned my desired column header output to a list where I will then reassign the column names accordingly. However, I feel like I’m not approaching the problem efficiently and am wondering if anyone has any advise on how to better structure this part of my code.

Asked By: messy748

||

Source

Answer 1

Maybe something like this, you example works fine also.

def clean_column_header(column):

    regex = re.compile('[()'':_0-9]')
    
    column = column.replace("1st", "First") 
                   .replace("%", "Pct") 
                   .strip("Unnamed level, ")

    return regex.sub("", column)

df = df.rename(columns = {column:clean_column_header(column) for column in df.columns})

Answered By: apw-ub

Answer 2

column name is tuple. so use as follows:

Code:

cols = [x[1] for x in df.columns]
cols = list(map(lambda x: x.replace('1st', 'First'), cols))
cols = list(map(lambda x: x.replace('%', 'Pct'), cols))
print(cols)

Output:

['Rk', 'Tm', 'G', 'PF', 'Yds', 'Ply', 'Y/P', 'TO', 'FL', 'FirstD', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'NY/A', 'FirstD', 'Att', 'Yds', 'TD', 'Y/A', 'FirstD', 'Pen', 'Yds', 'FirstPy', 'ScPct', 'TOPct', 'EXP']

Answered By: Aaj Kaal

Answer 3

you can use .str methods for pd.Index:

df.columns = (df.columns
                .str.replace('1st', 'First')
                .str.replace('%', 'Pct')
                .str.replace(r'[()'':_0-9]', '')
                .str.strip('Unnamed level, '))

Answered By: ABC

Answer 4

Transform multi-level index column to dataframe and handle it.

obj_col = pd.DataFrame(df.columns.tolist())

# replace column level 0 with 'Unnamed:' as ''
cond = obj_col[0].str.contains('Unnamed:')
obj_col['0_'] = np.where(cond, '', obj_col[0])

# special handle
obj_col['1_'] = (obj_col[1].str.replace('1st', 'First')
                           .str.replace ('%', 'Pct'))

# concat and strip
obj_col['col_name'] = ((obj_col['0_'] + ', ' + obj_col['1_'])
                        .str.strip(', '))

print(obj_col)

output

                     0      1            0_       1_            col_name
0    Unnamed: 0_level_0     Rk                     Rk                  Rk
1    Unnamed: 1_level_0     Tm                     Tm                  Tm
2    Unnamed: 2_level_0      G                      G                   G
3    Unnamed: 3_level_0     PF                     PF                  PF
4    Unnamed: 4_level_0    Yds                    Yds                 Yds
5          Tot Yds & TO    Ply  Tot Yds & TO      Ply   Tot Yds & TO, Ply
6          Tot Yds & TO    Y/P  Tot Yds & TO      Y/P   Tot Yds & TO, Y/P
7          Tot Yds & TO     TO  Tot Yds & TO       TO    Tot Yds & TO, TO
8    Unnamed: 8_level_0     FL                     FL                  FL
9    Unnamed: 9_level_0   1stD                 FirstD              FirstD
10              Passing    Cmp       Passing      Cmp        Passing, Cmp
11              Passing    Att       Passing      Att        Passing, Att
12              Passing    Yds       Passing      Yds        Passing, Yds
13              Passing     TD       Passing       TD         Passing, TD
14              Passing    Int       Passing      Int        Passing, Int
15              Passing   NY/A       Passing     NY/A       Passing, NY/A
16              Passing   1stD       Passing   FirstD     Passing, FirstD
17              Rushing    Att       Rushing      Att        Rushing, Att
18              Rushing    Yds       Rushing      Yds        Rushing, Yds
19              Rushing     TD       Rushing       TD         Rushing, TD
20              Rushing    Y/A       Rushing      Y/A        Rushing, Y/A
21              Rushing   1stD       Rushing   FirstD     Rushing, FirstD
22            Penalties    Pen     Penalties      Pen      Penalties, Pen
23            Penalties    Yds     Penalties      Yds      Penalties, Yds
24            Penalties  1stPy     Penalties  FirstPy  Penalties, FirstPy
25  Unnamed: 25_level_0    Sc%                  ScPct               ScPct
26  Unnamed: 26_level_0    TO%                  TOPct               TOPct
27  Unnamed: 27_level_0    EXP                    EXP                 EXP

Answered By: Ferris

Answer 5

There is a rename attribute of pd.DataFrame that you can utilize directly for such cases:

import pandas as pd
import re

# Pulling data
url = 'https://www.pro-football-reference.com/years/2019/opp.htm'
df = pd.read_html(url)[0]

regex = re.compile('[()'':_0-9]')
df.rename(columns = lambda x: regex.sub('',str(x).replace('1st', 'First').replace('%', 'Pct')).strip('Unnamed level, '), inplace=True)

Answered By: Hamza

What is a more efficient or "pythonic" way to clean column headers?

Question:

Answers: