How to speed up for loops in dataframe

Question:

I want to convert a data frame to the format I want by scanning each latitude and longitude in the for loop, but this process takes too long. Is there a way to make the following script faster, such as using multi threads or processing? Can you show me how?

p=0
for i in tqdm(df_wind_monthly["lat"]):
    for j in df_wind_monthly["lon"]:
        print("lat: " + str(i) + " lon: " + str(j))
        for k in range(1948,2017): 
            rslt_df_wind = df_wind_monthly.loc[(df_wind_monthly['lat'] == i) 
                                          & (df_wind_monthly['lon'] == j) 
                                              & (df_wind_monthly['year'] == k)]; rslt_df_wind = rslt_df_wind.reset_index()
            month_columns.loc[p,"lat"]=rslt_df_wind.loc[0,"lat"]
            month_columns.loc[p,"lon"]=rslt_df_wind.loc[0,"lon"]
            month_columns.loc[p,"years"]=rslt_df_wind.loc[0,"year"]
            month_columns.loc[p,"wind_January"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="January"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_February"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="February"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_March"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="March"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_April"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="April"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_May"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="May"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_June"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="June"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_July"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="July"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_August"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="August"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_September"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="September"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_October"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="October"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_November"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="November"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_December"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="December"].index.tolist()[0],"wind"]
            p+=1

These are the inputs and expected output samples:

            #df_wind_monthly (INPUT):
                         time     lat     lon      wind
              0       1948-01-16  15.125  15.125  6.509021
              1       1948-01-16  15.125  15.375  6.485108
              2       1948-01-16  15.125  15.625  6.472615
              3       1948-01-16  15.125  15.875  6.472596
              4       1948-01-16  15.125  16.125  6.486597

             #Expected dataframe columns (OUTPUT):
             month_columns=pd.DataFrame(columns=['lat',"lon","years","wind_January","wind_February","wind_March","wind_April","wind_May","wind_June","wind_July","wind_August","wind_September","wind_October","wind_November","wind_December"])
Asked By: fyec

||

Answers:

Looping is definetly NOT the way to go. If you type for ... in while using pandas DataFrame, you’re almost always doing it wrong.

What you want is to switch your data from long format (1 row = 1 observation) to wide format (1 row = 12 observations). It is a fairly common usecase, so pandas provides a method to do just that: DataFrame.pivot.

Starting from your input dataframe, you need to:

  • Add a year and month column from the time column
  • Select only the years you want (1948-2016)
  • Optionaly, if there is more than one measurement by month, compute the average, using groupby and mean
  • Switch from wide to long format with pivot.
  • Rename the columns

Input

>>> df_wind_monthly

             time     lat     lon   wind
0      1948-01-31  15.125  15.125  5.963
1      1948-01-31  15.125  15.375  6.404
2      1948-01-31  15.125  15.625  6.207
3      1948-01-31  15.125  15.875  6.972
4      1948-01-31  15.125  16.125  6.299
...           ...     ...     ...    ...
86395  2019-12-31  17.375  16.375  6.514
86396  2019-12-31  17.375  16.625  6.593
86397  2019-12-31  17.375  16.875  6.438
86398  2019-12-31  17.375  17.125  6.394
86399  2019-12-31  17.375  17.375  6.232

Processing

out_df = df_wind_monthly

# Create a DateTime index from the date column to easily extract year / month
date_index = pd.DatetimeIndex(out_df["time"])

# Create a year and month columns to select the years and perform the groupby
out_df = out_df.assign(year=date_index.year, month=date_index.month)

# Select the years you want
out_df = out_df[(out_df["year"] >= 1948) & (out_df["year"] < 2017)]

# If there are multiple measurement for a given year / month, compute the average
# Skip this if you know there is only one measurement per month
out_df = out_df.groupby(["lat", "lon", "year", "month"]).mean().reset_index()

# Switch from long to wide format
out_df = out_df.pivot(index=["lat", "lon", "year"], columns="month", values="wind")

# Rename the columns
out_df = out_df.rename(
    columns={
        1: "wind_January",
        2: "wind_February",
        3: "wind_March",
        4: "wind_April",
        5: "wind_May",
        6: "wind_June",
        7: "wind_July",
        8: "wind_August",
        9: "wind_September",
        10: "wind_October",
        11: "wind_November",
        12: "wind_December",
    }
)

# Reset Index if you prefer to have the data in columns
out_df = out_df.reset_index()

Output:

>>> out_df

month     lat     lon  year  wind_January  ...  wind_September  wind_October  wind_November  wind_December      
0      15.125  15.125  1948         5.963  ...           6.885         6.814          6.131          6.063      
1      15.125  15.125  1949         6.304  ...           6.178         6.536          6.426          6.090      
2      15.125  15.125  1950         6.207  ...           6.890         6.719          6.875          5.925      
3      15.125  15.125  1951         6.100  ...           6.153         6.905          6.034          6.470      
4      15.125  15.125  1952         5.951  ...           6.638         6.294          6.434          5.936      
...       ...     ...   ...           ...  ...             ...           ...            ...            ...      
6895   17.375  17.375  2012         6.674  ...           6.841         6.383          6.685          6.616      
6896   17.375  17.375  2013         6.674  ...           6.469         6.940          6.842          6.154      
6897   17.375  17.375  2014         6.794  ...           6.251         6.267          6.258          5.942      
6898   17.375  17.375  2015         6.760  ...           6.933         6.848          6.765          6.446      
6899   17.375  17.375  2016         6.253  ...           6.986         6.490          6.421          6.338