How can I 1) change a part of text in values (e.g., ', ' -> '__') and 2) give different values to missing values in Python dataframe?

Question:

I converted a JSON variable to multiple paired variables.
As a result, I have a dataset like

home_city_1       home_number_1  home_city_2          home_number_2   home_city_3   home_number_3  home_city_4      home_number_4
Coeur D Alene, ID   13.0         Hayden, ID           8.0             Renton, WA     2.0           NaN               NaN
Spokane, WA         3.0          Amber, WA            2.0             NaN            NaN           NaN               NaN
Sioux Falls, SD     9.0          Stone Mountain, GA   2.0             Watertown, SD  2.0           Dell Rapids, SD   2.0
Ludowici, GA        11.0         NaN                  NaN             NaN            NaN           NaN               NaN

This data set has 600 columns (300 * 2).

I want to convert the values with those conditions:

  1. Change ‘ ‘ or ‘,’ in the home_city_# column values to ‘_’ (under bar). For example, ‘Sioux Falls, SD’ to ‘Sioux_Falls__SD’
  2. Convert missing values to ‘m’ (missing in home_city_#) or -1 (missing in home_number_#)

I have tried

customer_home_city_json_2 = customer_home_city_json_1.replace(',', '_')

customer_home_city_json_2 = customer_home_city_json_2 .apply(lambda x: x.replace('null', "-1"))
Asked By: Peter Park

||

Answers:

Try

citys = [col  for col in df.columns if 'home_city_' in col]
numbers = [col  for col in df.columns if 'home_number_' in col]

df[citys] = df[citys].replace("s|,", "_", regex=True)
df[citys] = df[citys].fillna('m')
df[numbers] = df[numbers].fillna(-1)

To perform the correct tasks you have to get columns names for ‘home_city_#’ and ‘home_number_#’. This is done in the first two lines.

For replacing " " and "," with "_" I call replace() with regex=True to use regular expressions. s (is a shortcut) and removes all whitespaces, this could be replaced also by .

For filling the NaNs I use fillna and set the wanted value -1 or m. I suggestnot to mix types in a column. Therefor I use -1 for "numbers" and m for citys.

Example

It this is you DataFrame

         home_city_1  home_number_1         home_city_2  home_number_2
0  Coeur D Alene, ID           13.0          Hayden, ID            8.0
1        Spokane, WA            3.0           Amber, WA            2.0
2    Sioux Falls, SD            9.0  Stone Mountain, GA            2.0
3       Ludowici, GA           11.0                 NaN            NaN

the output will be

         home_city_1  home_number_1         home_city_2  home_number_2
0  Coeur_D_Alene__ID           13.0          Hayden__ID            8.0
1        Spokane__WA            3.0           Amber__WA            2.0
2    Sioux_Falls__SD            9.0  Stone_Mountain__GA            2.0
3       Ludowici__GA           11.0                   m           -1.0
Answered By: mosc9575

Considering that df is the name of your dataframe, you can try this :

city_cols = df.filter(regex='^home_city').columns
df[city_cols] = (df[city_cols]
                 .replace('', '-')
                 .replace(',', '-', regex=True)
                 .fillna('m'))

number_cols = df.filter(regex='^home_number').columns
df[number_cols] = df[number_cols].fillna(-1)

By using pandas.DataFrame.filter and regex you can filter by columns that have the same prefix.

Answered By: L'Artiste
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.