Creating a new dataframe where a field is blank in the original dataframe
Question:
Using Python3 and Pandas. I am admittedly pretty new and I’m having a hard time searching for an answer to this question.
I have a dataframe that contains lots of information and I’m trying to get a dataframe that is just the items where one specific field in the original is blank.
I have queried my database to get a dataframe I am calling full_df which is all information on all items in the database. I want to now create a new dataframe that selects just the items where one field in full_df is blank.
This is what I’ve tried:
no_rate = full_df[(full_df['rate'] == "")]
Which is returning nothing even though I know for a fact that there are loads of items where ‘rate’ is blank. I expected the dataframe no_rate to be populated with all the items where ‘rate’ is blank.
How do I select those items for this new dataframe?
Answers:
There are a few things you need to do. First of all, is the data type of your rate column a string, or object? df.dtypes
will tell you. If not, then you can’t test it against ""
.
Second, and more to the point, a way to do a conditional select is by useing loc
.
So, if your rate column looks like this
df = pd.DataFrame({'Rate': ['good', 'good', 'bad', 'medium', '', 'bad', '', 'good']})
df
Rate
0 good
1 good
2 bad
3 medium
4
5 bad
6
7 good
then you could write
df.loc[df['Rate']==""]
and get
Rate
4
6
which is actually showing you the contents, but since there is nothing in there, it looks like just the row numbers. Let’s add another column to see the results more plainly.
df['Color'] = ['Red', 'Blue', 'Yellow', 'Red', 'Yellow', 'Red', 'Green', 'Blue']
df
Rate Color
0 good Red
1 good Blue
2 bad Yellow
3 medium Red
4 Yellow
5 bad Red
6 Green
7 good Blue
and
df.loc[df['Rate'] == ""]
shows
Rate Color
4 Yellow
6 Green
So, what if your rate is actually a number
df['Decimal_Rate'] = [.8, .8, .3, .6, np.nan, .2, np.nan, .9]
df
Rate Color Decimal_Rate
0 good Red 0.8
1 good Blue 0.8
2 bad Yellow 0.3
3 medium Red 0.6
4 Yellow
5 bad Red 0.2
6 Green
7 good Blue 0.9
if you wanted to isolate the empty cells of numbers, you can go like this:
df.loc[df['Decimal_Rate'].isna()]
which results in
Rate Color Decimal_Rate
4 Yellow
6 Green
Using Python3 and Pandas. I am admittedly pretty new and I’m having a hard time searching for an answer to this question.
I have a dataframe that contains lots of information and I’m trying to get a dataframe that is just the items where one specific field in the original is blank.
I have queried my database to get a dataframe I am calling full_df which is all information on all items in the database. I want to now create a new dataframe that selects just the items where one field in full_df is blank.
This is what I’ve tried:
no_rate = full_df[(full_df['rate'] == "")]
Which is returning nothing even though I know for a fact that there are loads of items where ‘rate’ is blank. I expected the dataframe no_rate to be populated with all the items where ‘rate’ is blank.
How do I select those items for this new dataframe?
There are a few things you need to do. First of all, is the data type of your rate column a string, or object? df.dtypes
will tell you. If not, then you can’t test it against ""
.
Second, and more to the point, a way to do a conditional select is by useing loc
.
So, if your rate column looks like this
df = pd.DataFrame({'Rate': ['good', 'good', 'bad', 'medium', '', 'bad', '', 'good']})
df
Rate
0 good
1 good
2 bad
3 medium
4
5 bad
6
7 good
then you could write
df.loc[df['Rate']==""]
and get
Rate
4
6
which is actually showing you the contents, but since there is nothing in there, it looks like just the row numbers. Let’s add another column to see the results more plainly.
df['Color'] = ['Red', 'Blue', 'Yellow', 'Red', 'Yellow', 'Red', 'Green', 'Blue']
df
Rate Color
0 good Red
1 good Blue
2 bad Yellow
3 medium Red
4 Yellow
5 bad Red
6 Green
7 good Blue
and
df.loc[df['Rate'] == ""]
shows
Rate Color
4 Yellow
6 Green
So, what if your rate is actually a number
df['Decimal_Rate'] = [.8, .8, .3, .6, np.nan, .2, np.nan, .9]
df
Rate Color Decimal_Rate
0 good Red 0.8
1 good Blue 0.8
2 bad Yellow 0.3
3 medium Red 0.6
4 Yellow
5 bad Red 0.2
6 Green
7 good Blue 0.9
if you wanted to isolate the empty cells of numbers, you can go like this:
df.loc[df['Decimal_Rate'].isna()]
which results in
Rate Color Decimal_Rate
4 Yellow
6 Green