Is there a Pythonic way to get the key associated with a failed dict or pandas row assignment?

Question:

I have a data pipeline where we processing data as a pandas DataFrame. We need to do a bunch of operations on each row where some operations on one column depend on the values in other columns, so we use pandas apply, similar to:

def check_row(row):
    if row['C'] == 'foo':
        row['B'] += row['A']
        
    row['C'] = row['C'].zfill(5)
    
    #etc
    
    return row

df = pd.DataFrame([(1, 2, 'foo'), 
                   (4, 5, 'bar'), 
                   (7,8,'baz')], 
                  columns=['A', 'B', 'C'])

df.apply(check_row, axis=1)

Sometimes, the data that comes into this pipeline does not satisfy our assumptions, and exceptions are generate (e.g. non-string value in column C). I would like to catch these exceptions and flag these rows.

Currently we wrap the entire check_row function in a try-except block and note those rows as problems (there are many such assignments). However, we lose track of which actual assignment failed. Is there a more pythonic way to catch the specific assignment, other than wrapping each in its own try-except? This feels ugly:

def check_row(row):
    try:
        if row['C'] == 'foo':
            row['B'] += row['A']
    except Exception as e:
        row['errors'] = f"Failed to assign to B: {repr(e)}"

    try:
       row['C'] = row['C'].zfill(5)
    except Exception as e:
        row['errors'] = f"Failed to assign to C: {repr(e)}"
    
    #etc
    
    return row

I thought about something like:

def assign(column, value):
    """inside the scope of check_row"""
    try:
        row[column] = value
    except Exception as e:
        row['errors'] = f"Failed to assign to {column}: {repr(e)}"

But of course its the calculation of value that’s failing, not the actual assignment, so this doesn’t quite do it. Any ideas?

Asked By: daddydan

||

Answers:

It seems like you don’t actually need the key itself, just some way to determine which part failed. And it’s not necessarily the assignment that’s failing either. So how about using the traceback?

Exception objects contain a reference to their traceback, so if you keep them around, you can refer back to it later. Here’s an example:

df = pd.DataFrame(
    [(1, 2, 'foo'),
     (4, 5, 'bar'),
     (7, 8, 'baz'),
     (3, 9, 10500)],
    columns=['A', 'B', 'C'])

errors = []

def wrap_check_row(row):
    try:
        return check_row(row)
    except Exception as e:
        errors.append((row, e))

df.apply(wrap_check_row, axis=1)
     A    B      C
0  1.0  3.0  00foo
1  4.0  5.0  00bar
2  7.0  8.0  00baz
3  NaN  NaN   None

Then afterwards, we can see which rows failed and the relevant exception:

import sys
import traceback

for row, exc in errors:
    print('Failed index:', row.name, file=sys.stderr)
    traceback.print_exception(type(exc), exc, exc.__traceback__)
Failed index: 3
Traceback (most recent call last):
  File "<ipython-input-8-bc47ffa7c200>", line 6, in wrap_check_row
    return check_row(row)
  File "<ipython-input-3-ceb5d79bdcf5>", line 5, in check_row
    row['C'] = row['C'].zfill(5)
AttributeError: 'int' object has no attribute 'zfill'

Sidenote: In Python 3.10, it looks like you can simplify the print_exception call to traceback.print_exception(exc).

Answered By: wjandrea
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.