Updating a value in a pandas dataframe in an iterrows loop

Question:

I am doing some geocoding work that I used selenium to screen scrape the x-y coordinate I need for address of a location, I imported an xls file to pandas dataframe and want to use an explicit loop to update the rows which do not have the x-y coordinate, like below:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi
        dict_temp = geocoding(target)
        row.wgs1984_latitude = dict_temp['lat']
        row.wgs1984_longitude = dict_temp['long']

I have read Why doesn't this function "take" after I iterrows over a pandas DataFrame? and am fully aware that iterrows only gives us a view rather than a copy for editing, but what if I really to update the value row by row? Is lambda feasible?

Asked By: lokheart

||

Answers:

The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don’t change your dataframe. Thankfully, because each item you get back from iterrows contains the current index, you can use that to access and edit the relevant row of the dataframe:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi        
        dict_temp = geocoding(target)
        rche_df.loc[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.loc[index, 'wgs1984_longitude'] = dict_temp['long']

In my experience, this approach seems slower than using an approach like apply or map, but as always, it’s up to you to decide how to make the performance/ease of coding tradeoff.

Answered By: Marius

Another way based on this question:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi        
        dict_temp = geocoding(target)
        
        rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']

This link describe difference between .loc and .at. Shortly, .at faster than .loc.

Answered By: Alireza Mazochi

1. Use itertuples() instead

Pandas DataFrames are really a collection of columns/Series objects (e.g. for x in df iterates over the column labels), so even if a loop where to be implemented, it’s better if the loop over across columns. iterrows() is anti-pattern to that "native" pandas behavior because it creates a Series for each row, which slows down code so much. A better/faster option is to use itertuples(). It creates namedtuples of each row that you can access either by index or column label. There’s almost no modification to the code in the OP to apply it.

Also (as @Alireza Mazochi mentioned), to assign a value to a single cell, at is faster than loc.

for row in rche_df.itertuples():
#                  ^^^^^^^^^^   <------ `itertuples` instead of `iterrows`
    if isinstance(row.wgs1984_latitude, float):
        target = row.address_chi
        dict_temp = geocoding(target)
        rche_df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
        #       ^^ ^^^^^^^^^  <---- `at` instead of `loc` for faster assignment
        #                           `row.Index` is the row's index, can also use `row[0]`

As you can see, using itertuples() is almost the same syntax as using iterrows(), yet it’s over 6 times faster (you can verify it with a simple timeit test).

2. to_dict() is also an option

One drawback of itertuples() is that whenever there’s a space in a column label (e.g. 'Col A' etc.), it will be mangled when converted into a namedtuple, so e.g. if 'address_chi' was 'address chi', it will not be possible to access it via row.address chi. One way to solve this problem is to convert the DataFrame into a dictionary and iterate over it.

Again, the syntax is almost the same as the one used for iterrows().

for index, row in rche_df.to_dict('index').items():
#                         ^^^^^^^^^^^^^^^^^^^^^^^^  <---- convert to a dict
    if isinstance(row['wgs1984_latitude'], float):
        target = row['address_chi']
        dict_temp = geocoding(target)
        rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']

This method is also about 6 times faster than iterrows() but slightly slower than itertuples() (also it’s more memory-intensive than itertuples() because it creates an explicit dictionary whereas itertuples() creates a generator).

3. Iterate over only the necessary column/rows

The main bottleneck in the particular code in the OP (and in general, why a loop is sometimes necessary in a pandas dataframe) is that the function geocoding() is not vectorized. So one way to make the code much faster is to call it only on the relevant column ('address_chi') and on the relevant rows (filtered using a boolean mask).

Note that creating the boolean mask was necessary only because there was an if-clause in the original code. If a conditional check was not needed, the boolean mask is not needed, so the necessary loop boils down to a single loop over a particular column (address_chi).

# boolean mask to filter only the relevant rows
# this is analogous to if-clause in the loop in the OP
msk = [isinstance(row, float) for row in rche_df['wgs1984_latitude'].tolist()]

# call geocoding on the relevant values 
# (filtered using the boolean mask built above) 
# in the address_chi column
# and create a nested list
out = []
for target in rche_df.loc[msk, 'address_chi'].tolist():
    dict_temp = geocoding(target)
    out.append([dict_temp['lat'], dict_temp['long']])

# assign the nested list to the relevant rows of the original frame
rche_df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out

This method is about 40 times faster than iterrows().


A working example and performance test

def geocoding(x):
    return {'lat': x*2, 'long': x*2}


def iterrows_(df):
    
    for index, row in df.iterrows():
        if isinstance(row.wgs1984_latitude, float):
            target = row.address_chi        
            dict_temp = geocoding(target)
            df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[index, 'wgs1984_longitude'] = dict_temp['long']
    
    return df


def itertuples_(df):
    
    for row in df.itertuples():
        if isinstance(row.wgs1984_latitude, float):
            target = row.address_chi
            dict_temp = geocoding(target)
            df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
        
    return df


def to_dict_(df):
    
    for index, row in df.to_dict('index').items():
        if isinstance(row['wgs1984_latitude'], float):
            target = row['address_chi']
            dict_temp = geocoding(target)
            df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[index, 'wgs1984_longitude'] = dict_temp['long']
            
    return df


def boolean_mask_loop(df):

    msk = [isinstance(row, float) for row in df['wgs1984_latitude'].tolist()]

    out = []
    for target in df.loc[msk, 'address_chi'].tolist():
        dict_temp = geocoding(target)
        out.append([dict_temp['lat'], dict_temp['long']])

    df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
    
    return df


df = pd.DataFrame({'address_chi': range(20000)})
df['wgs1984_latitude'] = pd.Series([x if x%2 else float('nan') for x in df['address_chi'].tolist()], dtype=object)


%timeit itertuples_(df.copy())
# 248 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit boolean_mask_loop(df.copy())
# 38.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_dict_(df.copy())
# 289 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit iterrows_(df.copy())
# 1.57 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By: cottontail
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.