Is there a Pythonic way to get the key associated with a failed dict or pandas row assignment?
Question:
I have a data pipeline where we processing data as a pandas DataFrame. We need to do a bunch of operations on each row where some operations on one column depend on the values in other columns, so we use pandas apply
, similar to:
def check_row(row):
if row['C'] == 'foo':
row['B'] += row['A']
row['C'] = row['C'].zfill(5)
#etc
return row
df = pd.DataFrame([(1, 2, 'foo'),
(4, 5, 'bar'),
(7,8,'baz')],
columns=['A', 'B', 'C'])
df.apply(check_row, axis=1)
Sometimes, the data that comes into this pipeline does not satisfy our assumptions, and exceptions are generate (e.g. non-string value in column C). I would like to catch these exceptions and flag these rows.
Currently we wrap the entire check_row function in a try-except block and note those rows as problems (there are many such assignments). However, we lose track of which actual assignment failed. Is there a more pythonic way to catch the specific assignment, other than wrapping each in its own try-except? This feels ugly:
def check_row(row):
try:
if row['C'] == 'foo':
row['B'] += row['A']
except Exception as e:
row['errors'] = f"Failed to assign to B: {repr(e)}"
try:
row['C'] = row['C'].zfill(5)
except Exception as e:
row['errors'] = f"Failed to assign to C: {repr(e)}"
#etc
return row
I thought about something like:
def assign(column, value):
"""inside the scope of check_row"""
try:
row[column] = value
except Exception as e:
row['errors'] = f"Failed to assign to {column}: {repr(e)}"
But of course its the calculation of value that’s failing, not the actual assignment, so this doesn’t quite do it. Any ideas?
Answers:
It seems like you don’t actually need the key itself, just some way to determine which part failed. And it’s not necessarily the assignment that’s failing either. So how about using the traceback?
Exception objects contain a reference to their traceback, so if you keep them around, you can refer back to it later. Here’s an example:
df = pd.DataFrame(
[(1, 2, 'foo'),
(4, 5, 'bar'),
(7, 8, 'baz'),
(3, 9, 10500)],
columns=['A', 'B', 'C'])
errors = []
def wrap_check_row(row):
try:
return check_row(row)
except Exception as e:
errors.append((row, e))
df.apply(wrap_check_row, axis=1)
A B C
0 1.0 3.0 00foo
1 4.0 5.0 00bar
2 7.0 8.0 00baz
3 NaN NaN None
Then afterwards, we can see which rows failed and the relevant exception:
import sys
import traceback
for row, exc in errors:
print('Failed index:', row.name, file=sys.stderr)
traceback.print_exception(type(exc), exc, exc.__traceback__)
Failed index: 3
Traceback (most recent call last):
File "<ipython-input-8-bc47ffa7c200>", line 6, in wrap_check_row
return check_row(row)
File "<ipython-input-3-ceb5d79bdcf5>", line 5, in check_row
row['C'] = row['C'].zfill(5)
AttributeError: 'int' object has no attribute 'zfill'
Sidenote: In Python 3.10, it looks like you can simplify the print_exception
call to traceback.print_exception(exc)
.
I have a data pipeline where we processing data as a pandas DataFrame. We need to do a bunch of operations on each row where some operations on one column depend on the values in other columns, so we use pandas apply
, similar to:
def check_row(row):
if row['C'] == 'foo':
row['B'] += row['A']
row['C'] = row['C'].zfill(5)
#etc
return row
df = pd.DataFrame([(1, 2, 'foo'),
(4, 5, 'bar'),
(7,8,'baz')],
columns=['A', 'B', 'C'])
df.apply(check_row, axis=1)
Sometimes, the data that comes into this pipeline does not satisfy our assumptions, and exceptions are generate (e.g. non-string value in column C). I would like to catch these exceptions and flag these rows.
Currently we wrap the entire check_row function in a try-except block and note those rows as problems (there are many such assignments). However, we lose track of which actual assignment failed. Is there a more pythonic way to catch the specific assignment, other than wrapping each in its own try-except? This feels ugly:
def check_row(row):
try:
if row['C'] == 'foo':
row['B'] += row['A']
except Exception as e:
row['errors'] = f"Failed to assign to B: {repr(e)}"
try:
row['C'] = row['C'].zfill(5)
except Exception as e:
row['errors'] = f"Failed to assign to C: {repr(e)}"
#etc
return row
I thought about something like:
def assign(column, value):
"""inside the scope of check_row"""
try:
row[column] = value
except Exception as e:
row['errors'] = f"Failed to assign to {column}: {repr(e)}"
But of course its the calculation of value that’s failing, not the actual assignment, so this doesn’t quite do it. Any ideas?
It seems like you don’t actually need the key itself, just some way to determine which part failed. And it’s not necessarily the assignment that’s failing either. So how about using the traceback?
Exception objects contain a reference to their traceback, so if you keep them around, you can refer back to it later. Here’s an example:
df = pd.DataFrame(
[(1, 2, 'foo'),
(4, 5, 'bar'),
(7, 8, 'baz'),
(3, 9, 10500)],
columns=['A', 'B', 'C'])
errors = []
def wrap_check_row(row):
try:
return check_row(row)
except Exception as e:
errors.append((row, e))
df.apply(wrap_check_row, axis=1)
A B C
0 1.0 3.0 00foo
1 4.0 5.0 00bar
2 7.0 8.0 00baz
3 NaN NaN None
Then afterwards, we can see which rows failed and the relevant exception:
import sys
import traceback
for row, exc in errors:
print('Failed index:', row.name, file=sys.stderr)
traceback.print_exception(type(exc), exc, exc.__traceback__)
Failed index: 3
Traceback (most recent call last):
File "<ipython-input-8-bc47ffa7c200>", line 6, in wrap_check_row
return check_row(row)
File "<ipython-input-3-ceb5d79bdcf5>", line 5, in check_row
row['C'] = row['C'].zfill(5)
AttributeError: 'int' object has no attribute 'zfill'
Sidenote: In Python 3.10, it looks like you can simplify the print_exception
call to traceback.print_exception(exc)
.