Applying a function to all but one column in Pandas

Question:

I’m looking to apply a function to all but one column in pandas, whilst maintaining that column as it is originally. I have a working version that does what I need, but it seems unusually long for something so simple. I’m wondering if there is a better approach…

df = pd.DataFrame(columns = ['firstcol','secondcol','thirdcol'],
     data=[['a1',1,'a6.1'],['b2',3,'b9.3'],['c12',4,'c2']])

My dataframe is made up of strings and integers. Each column is exclusive of a particular type, and I am looking to standardize so that all columns are numerical (integers or floats as I have some decimal values). So, in the toy data above, I need to transform the first and third columns, and leave the second column alone…

df.loc[:, df.columns != 'secondcol'] = df.loc[:, df.columns != 'secondcol'].applymap(lambda x: float(re.sub(r'[^d.]','', x)))

For clarity, this line: (1) specifies all but the column named "secondcol", (2) uses the applymap and lambda functions to remove non-numeric (or decimal) characters, and (3) converts to a float.

This produces the desired output, but as I say isn’t overly readable. Have I stumbled across the best way to do this, or is there a simpler alternative?

Asked By: DC_Liv

||

Answers:

We can use df.columns.difference (set nomenclature) to select "all but one column", subset the frame, and apply the operation:

bad_column     = "undesired_column"
other_cols     = df.columns.difference([bad_column])
df[other_cols] = df[other_cols].apply(...)

Note that we need to pass an array-like to .difference; it won’t accept bad_column string as is, so we wrap it in a list. If you already have lists of undesired columns, they are put as is, e.g.,

bad_columns    = ["bad_1", "bad_7"]
other_cols     = df.columns.difference(bad_columns)
df[other_cols] = df[other_cols].apply(...)

Instead of .apply, you can do other things to that subset of course.


(answering the original question, which is somewhat more specific than the title reads)

You can select the non-numeric data type columns and replace their values:

non_numerics = df.select_dtypes(exclude="number").columns

df[non_numerics] = df[non_numerics].apply(lambda x: x.str.replace(r"[^d.]", "").astype(float))

where I used your regex for replacing but not element-wise with applymap but column-wise with apply (you can change to applymap with your line if you will).

to get

>>> df

   firstcol  secondcol  thirdcol
0       1.0          1       6.1
1       2.0          3       9.3
2      12.0          4       2.0
Answered By: Mustafa Aydın

Readable is in the eye of beholder. Here’s what I would do:

import re

import pandas

df = pandas.DataFrame(
    columns=['firstcol', 'secondcol', 'thirdcol'],
    data=[['a1', 1, 'a6.1'], ['b2', 3, 'b9.3'], ['c12', 4, 'c2']]
)

df2 = (
    df.select_dtypes(exclude='number')
      .applymap(lambda x: float(re.sub(r'[^d.]','', x)))
      .join(df.select_dtypes(include='number'))
      .loc[:, df.columns.tolist()]  
)
Answered By: Paul H
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.