Missing rows in dataframe

Question:

I am trying to create a data frame that is a subset of the original based on specific results in a column but it keeps excluding some of the data, specifically codes 59960, 59961, 59962.

I have also confirmed that the column includes the identifier I am parsing for using .unique()

Here is my code:

new_df = original_df[(original_df["Course Offering Code"] == 19191)|
(original_df["Course Offering Code"] == 2201.20215)|
(original_df["Course Offering Code"] == 2387.2205)|
(original_df["Course Offering Code"] == 2388.20225)|
(original_df["Course Offering Code"] == 59960.20211)|
(original_df["Course Offering Code"] == 59961.20211)|
(original_df["Course Offering Code"] == 59962.20211)|
(original_df["Course Offering Code"] == 61199.20211)|
(original_df["Course Offering Code"] == 61201.20211)|
(original_df["Course Offering Code"] == 61202.20211)]

thank you!

Asked By: Kevin

||

Answers:

Try it like this instead…

codes = [19191, 2201, 2387, 59960, 59961, 59962, 61199, 61201, 61202]
new_df = original_df[original_df['Course Offering Code'].isin(codes)]
Answered By: BeRT2me

It is due to float comparisons that are not precise in pandas.

You will have to either round it or use close comparisons. Having said that, it looks like Course offering codes are just codes and might not need to be float64 – because technically a code can be represented by any unique number. Therefore, you can instead change the Course Offering Code column to str and select them instead, where you wont land into these problems.

Answered By: the_ordinary_guy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.