Using wildcard in Python
Question:
I have a df with 3 columns (col1, col2, and col3), and I have a user input that can enter a value for columns, just one, two or three, and of any combination, like (col1 or col1&col2, or col2&col3, etc.). Based on the user input, I need to select rows that have these values.
For example, if I have the following table:
col1 col2 col3
1 3 3
3 4 5
5 2 1
3 4 2
and so on, the user can enter value for col2 only, (4 in this case), then I have to display rows (2 and 4), or if they enter (1) for col1 then only row(1) to display, my logical formula would be like this:
x = input1
y = input2
z = imput3
a = df['col1'] == x and df['col2'] == y and df['col1'] == z
So the x, y and z can be any value based on the input, including nil (nil means all).
Any suggestion on how to write the code for such a formula?
Answers:
Use boolean combinations.
a = ((df['col1'] == x) | (x is None))
& ((df['col2'] == y) | (y is None))
& ((df['col3'] == z) | (z is None))
df1 = df[a]
None
is the wildcard here.
Note that &
and |
are not short-circuiting operators, so this could be expensive. Even when you have a wildcard for a column, it will still compare everything in that column with None
. A better way would be to construct the conditions dynamically.
condition = True
if x is not None:
condition &= df['col1'] == x
if y is not None:
condition &= df['col2'] == y
if z is not None:
condition &= df['col3'] == z
df1 = df[condition]
You could do this more dynamically by creating a dictionary of column names and conditions.
You can have a generic solution that works regardless of how many columns your dataframe has.
You first build a dictionary of column name and lookup values, then you can index the dataframe to find the row numbers containing these values and finally you slice the dataframe to display necessary rows:
import pandas as pd
df = pd.DataFrame(
{
'col1': [1, 3, 5, 3],
'col2': [3, 4, 2, 4],
'col3': [3, 5, 1, 2]
}
)
columns = [f'col{i}' for i in range(1, df.shape[1] + 1)]
query_dict = {}
for c in columns:
try:
query_dict[c] = int(input(f"Enter query for column {c} or Enter to skip: "))
except ValueError:
query_dict[c] = None
rows = []
for k, v in query_dict.items():
rows.extend((df.loc[df[k] == v]).index.values)
df.iloc[rows, :]
I ended up using the following code:
mask = True
if x:
mask &= (df['col1'] == x)
if y:
mask &= (df['col2'] == y)
if z:
mask &= (df['col3'] == z)
result = df[mask]
I have a df with 3 columns (col1, col2, and col3), and I have a user input that can enter a value for columns, just one, two or three, and of any combination, like (col1 or col1&col2, or col2&col3, etc.). Based on the user input, I need to select rows that have these values.
For example, if I have the following table:
col1 col2 col3
1 3 3
3 4 5
5 2 1
3 4 2
and so on, the user can enter value for col2 only, (4 in this case), then I have to display rows (2 and 4), or if they enter (1) for col1 then only row(1) to display, my logical formula would be like this:
x = input1
y = input2
z = imput3
a = df['col1'] == x and df['col2'] == y and df['col1'] == z
So the x, y and z can be any value based on the input, including nil (nil means all).
Any suggestion on how to write the code for such a formula?
Use boolean combinations.
a = ((df['col1'] == x) | (x is None))
& ((df['col2'] == y) | (y is None))
& ((df['col3'] == z) | (z is None))
df1 = df[a]
None
is the wildcard here.
Note that &
and |
are not short-circuiting operators, so this could be expensive. Even when you have a wildcard for a column, it will still compare everything in that column with None
. A better way would be to construct the conditions dynamically.
condition = True
if x is not None:
condition &= df['col1'] == x
if y is not None:
condition &= df['col2'] == y
if z is not None:
condition &= df['col3'] == z
df1 = df[condition]
You could do this more dynamically by creating a dictionary of column names and conditions.
You can have a generic solution that works regardless of how many columns your dataframe has.
You first build a dictionary of column name and lookup values, then you can index the dataframe to find the row numbers containing these values and finally you slice the dataframe to display necessary rows:
import pandas as pd
df = pd.DataFrame(
{
'col1': [1, 3, 5, 3],
'col2': [3, 4, 2, 4],
'col3': [3, 5, 1, 2]
}
)
columns = [f'col{i}' for i in range(1, df.shape[1] + 1)]
query_dict = {}
for c in columns:
try:
query_dict[c] = int(input(f"Enter query for column {c} or Enter to skip: "))
except ValueError:
query_dict[c] = None
rows = []
for k, v in query_dict.items():
rows.extend((df.loc[df[k] == v]).index.values)
df.iloc[rows, :]
I ended up using the following code:
mask = True
if x:
mask &= (df['col1'] == x)
if y:
mask &= (df['col2'] == y)
if z:
mask &= (df['col3'] == z)
result = df[mask]