Create new column by comparing existing column in a DataFrame

Question:

I have the following DataFrame:

   datetime            day_fetched         col_a col_b 
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200  
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400  
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500  
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700 
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300 
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200 

And I want to create a new column that will take the value 2 if there is a difference in the date between datetime and day_fetched and value 1 if there is no difference.

So my new Dataframe should look like this:

   datetime            day_fetched         col_a col_b day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 100  200    2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120  400    1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 140  500    2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160  700    1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 200  300    2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430  200    1

Then based on the column[‘day_ahead‘], I want to split the col_a and col_b, into col_a_1 and col_a_2 and col_b_1 and col_b_2.

So the final DataFrame will look like this:

   datetime            day_fetched         col_a_1 col_a_2 col_b_1 col_b_2 day_ahead
0  2023-01-02 12:00:00 2023-01-01 12:00:00 NaN     200     NaN     200     2
1  2023-01-02 12:00:00 2023-01-02 12:00:00 120     NaN     100     NaN     1
2  2023-01-03 12:00:00 2023-01-02 12:00:00 NaN     500     NaN     200     2
3  2023-01-03 12:00:00 2023-01-03 12:00:00 160     NaN     100     NaN     1
4  2023-01-04 12:00:00 2023-01-03 12:00:00 NaN     300     NaN     200     2
5  2023-01-04 12:00:00 2023-01-04 12:00:00 430     NaN     100     NaN     1
Asked By: Pythoneer

||

Answers:

One solution is to use np.where:

import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[["2023-01-02 12:00:00", "2023-01-01 12:00:00", 100,  200],
["2023-01-02 12:00:00", "2023-01-02 12:00:00", 120,  400],
["2023-01-03 12:00:00", "2023-01-02 12:00:00", 140,  500],
["2023-01-03 12:00:00", "2023-01-03 12:00:00", 160,  700],
["2023-01-04 12:00:00", "2023-01-03 12:00:00", 200,  300],
["2023-01-04 12:00:00", "2023-01-04 12:00:00", 430,  200]],
columns=["datetime","day_fetched","col_a","col_b"])

# days ahead
df["day_ahead"] = np.where(df["datetime"] == df["day_fetched"], 1, 2)
# column of None's for next section
df["na"] = None
# overwrite dataframe with new df
df = pd.DataFrame(data=np.where(df["day_ahead"] == 1,
                                [df["datetime"], df["day_fetched"],
                                 df["col_a"], df["na"],
                                 df["col_b"], df["na"],
                                 df["day_ahead"]],
                                [df["datetime"], df["day_fetched"],
                                 df["na"], df["col_a"],
                                 df["na"], df["col_b"],
                                 df["day_ahead"]]).T,
                  columns=["datetime", "day_fetched",
                           "col_a_1", "col_a_2",
                           "col_b_1", "col_b_2",
                           "day_ahead"])

df
#               datetime          day_fetched col_a_1  ... col_b_1 col_b_2 day_ahead
# 0  2023-01-02 12:00:00  2023-01-01 12:00:00    None  ...    None     200         2
# 1  2023-01-02 12:00:00  2023-01-02 12:00:00     120  ...     400    None         1
# 2  2023-01-03 12:00:00  2023-01-02 12:00:00    None  ...    None     500         2
# 3  2023-01-03 12:00:00  2023-01-03 12:00:00     160  ...     700    None         1
# 4  2023-01-04 12:00:00  2023-01-03 12:00:00    None  ...    None     300         2
# 5  2023-01-04 12:00:00  2023-01-04 12:00:00     430  ...     200    None         1

# [6 rows x 7 columns]

When asking a question please provide data that can be easily copied, such as by using df.to_dict().

EDIT – Generalised for many columns

Here is a (more complicated) bit of code that uses a list comprehension to pivot based on the value of day_ahead for each col_ and concatenates these to produce the same result:

df = pd.concat(
    [df.pivot_table(index=[df.index, "datetime", "day_fetched"],
                    columns=["day_ahead"],
                    values=x).add_prefix(x+"_") for x in 
     df.columns[df.columns.str.startswith("col_")]] + 
        [df.set_index([df.index, "datetime", "day_fetched"])["day_ahead"]],
        axis=1).reset_index(level=[1, 2])

The second, third and fourth lines above create the pivot table and adds the column name and "_" as a prefix, and this is a list comprehension for each column in df that starts with "col_" (fifth line). The sixth and seventh lines add the day_ahead column at the end of the DataFrame. The eighth line resets the index so that datetime and day_fetched are columns.

Answered By: Rawson