Select row from a DataFrame based on the type of the object(i.e. str)

Question:

So there’s a DataFrame say:

>>> df = pd.DataFrame({
...                 'A':[1,2,'Three',4],
...                 'B':[1,'Two',3,4]})
>>> df
       A    B
0      1    1
1      2  Two
2  Three    3
3      4    4

I want to select the rows whose datatype of particular row of a particular column is of type str.

For example I want to select the row where type of data in the column A is a str.
so it should print something like:

   A      B
2  Three  3

Whose intuitive code would be like:

df[type(df.A) == str]

Which obviously doesn’t works!

Thanks please help!

Asked By: Devi Prasad Khatua

||

Answers:

You can do something similar to what you’re asking with

In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]: 
       A  B
2  Three  3

Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:

In [16]: df.A.dtype
Out[16]: dtype('O')

Consequently, you can’t ask which rows are of what type – they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).

Answered By: Ami Tavory

This works:

df[df['A'].apply(lambda x: isinstance(x, str))]
Answered By: DrTRD

It’s generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object, which is nothing more than a sequence of pointers. Much like list and, indeed, many operations on such series can be more efficiently processed with list.

With this disclaimer, you can use Boolean indexing via a list comprehension:

res = df[[isinstance(value, str) for value in df['A']]]

print(res)

       A  B
2  Three  3

The equivalent is possible with pd.Series.apply, but this is no more than a thinly veiled loop and may be slower than the list comprehension:

res = df[df['A'].apply(lambda x: isinstance(x, str))]

If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:

res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]
Answered By: jpp
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.