Select row from a DataFrame based on the type of the object(i.e. str)
Question:
So there’s a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str
.
For example I want to select the row where type
of data in the column A
is a str
.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn’t works!
Thanks please help!
Answers:
You can do something similar to what you’re asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can’t ask which rows are of what type – they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
It’s generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object
, which is nothing more than a sequence of pointers. Much like list
and, indeed, many operations on such series can be more efficiently processed with list
.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply
, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]
So there’s a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str
.
For example I want to select the row where type
of data in the column A
is a str
.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn’t works!
Thanks please help!
You can do something similar to what you’re asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can’t ask which rows are of what type – they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
It’s generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object
, which is nothing more than a sequence of pointers. Much like list
and, indeed, many operations on such series can be more efficiently processed with list
.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply
, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]