What risks are associated with formatting a string before passing it to pandas.eval?

Question

I do a lot of queries on dataframes in pandas. pandas.DataFrame.query takes human-readable strings that represent dataframe operations and then evaluates them with pandas.eval (documented here). As discussed in this question, I’d like to write functions like the ones below that compile the query strings in a way that doesn’t generate flake8 errors. I’ve been told that formatting strings that are about to be eval’d is "an extremely slippery slope" but I’m not sure why.

What’s wrong with formatting query strings this way?

import pandas as pd

def query_column_equality(df, a, b):
    if a in df.columns:
        return df.query('{} == @q'.format(a), local_dict={'q': b})
    else:
        raise KeyError(a)

def query_column_isin(df, a, b):
    if a in df.columns:
        return df.query('{}.isin(@q)'.format(a), local_dict={'q': b})
    else:
        raise KeyError(a)

df = pd.DataFrame({'a': [1, 2], 'b': [10, 20]})

query_value = 20
print(query_column_equality(df, 'b', query_value))

query_iterable = [20]
print(query_column_isin(df, 'b', query_iterable))

Asked By: shortorian

||

Source

Answer 1

I’ve been told that formatting strings that are about to be eval’d is "an extremely slippery slope" but I’m not sure why.

I think that’s generally good advice. I think there are some problems which can only be solved with eval(), but 99% of the code which uses eval() can be more easily and more safely solved with some other method.

I would include this in that 99%.

For example, both of these can be re-written in a safer fashion:

return df.query('{} == @q'.format(a), local_dict={'q': b})
...
return df.query('{}.isin(@q)'.format(a), local_dict={'q': b})

can become:

return df.loc[df[a] == b]
...
return df.loc[df[a].isin(b)]

my question is whether or not the code here is risky – is the check on columns sufficient to run the query safely?

No, I don’t think so. Imagine you load an untrusted CSV using read_csv somewhere in your program. A CSV can legally contain the column import os; os.system('rm -rf *'). If you check that the column you’re querying actually exists in the dataframe, that check will succeed.

There are ways you could stop that, and prevent any vulnerabilities. But ultimately they are not as simple as using alternative methods which are not vulnerable to injection.

Answered By: Nick ODell

What risks are associated with formatting a string before passing it to pandas.eval?

Question:

Answers: