Missing rows in dataframe
Question:
I am trying to create a data frame that is a subset of the original based on specific results in a column but it keeps excluding some of the data, specifically codes 59960, 59961, 59962.
I have also confirmed that the column includes the identifier I am parsing for using .unique()
Here is my code:
new_df = original_df[(original_df["Course Offering Code"] == 19191)|
(original_df["Course Offering Code"] == 2201.20215)|
(original_df["Course Offering Code"] == 2387.2205)|
(original_df["Course Offering Code"] == 2388.20225)|
(original_df["Course Offering Code"] == 59960.20211)|
(original_df["Course Offering Code"] == 59961.20211)|
(original_df["Course Offering Code"] == 59962.20211)|
(original_df["Course Offering Code"] == 61199.20211)|
(original_df["Course Offering Code"] == 61201.20211)|
(original_df["Course Offering Code"] == 61202.20211)]
thank you!
Answers:
Try it like this instead…
codes = [19191, 2201, 2387, 59960, 59961, 59962, 61199, 61201, 61202]
new_df = original_df[original_df['Course Offering Code'].isin(codes)]
It is due to float
comparisons that are not precise in pandas.
You will have to either round it or use close comparisons. Having said that, it looks like Course offering codes are just codes and might not need to be float64
– because technically a code can be represented by any unique number. Therefore, you can instead change the Course Offering Code
column to str
and select them instead, where you wont land into these problems.
I am trying to create a data frame that is a subset of the original based on specific results in a column but it keeps excluding some of the data, specifically codes 59960, 59961, 59962.
I have also confirmed that the column includes the identifier I am parsing for using .unique()
Here is my code:
new_df = original_df[(original_df["Course Offering Code"] == 19191)|
(original_df["Course Offering Code"] == 2201.20215)|
(original_df["Course Offering Code"] == 2387.2205)|
(original_df["Course Offering Code"] == 2388.20225)|
(original_df["Course Offering Code"] == 59960.20211)|
(original_df["Course Offering Code"] == 59961.20211)|
(original_df["Course Offering Code"] == 59962.20211)|
(original_df["Course Offering Code"] == 61199.20211)|
(original_df["Course Offering Code"] == 61201.20211)|
(original_df["Course Offering Code"] == 61202.20211)]
thank you!
Try it like this instead…
codes = [19191, 2201, 2387, 59960, 59961, 59962, 61199, 61201, 61202]
new_df = original_df[original_df['Course Offering Code'].isin(codes)]
It is due to float
comparisons that are not precise in pandas.
You will have to either round it or use close comparisons. Having said that, it looks like Course offering codes are just codes and might not need to be float64
– because technically a code can be represented by any unique number. Therefore, you can instead change the Course Offering Code
column to str
and select them instead, where you wont land into these problems.