Pandas: How to return rows where a column has a line breaks/new line ( n ) in its cell?
Question:
I am trying to return rows if a column contains a line break and specific word following it. So ‘nWord’.
Here is a minimal example
testdf = pd.DataFrame([['test1', ' generates the final summary. nRESULTS We evaluate the performance of ', ], ['test2', 'the cat and bat nnnRESULTSn teamed up to find some food'], ['test2' , 'anthropology with RESULTS pharmacology and biology']])
testdf.columns = ['A', 'B']
testdf.head()
> A B
>0 test1 generates the final summary. nRESULTS We evaluate the performance of
>1 test2 the cat and bat nnnRESULTSn teamed up to find some food
>2 test2 anthropology with RESULTS pharmacology and biology
listStrings = { 'nRESULTSn'}
testdf.loc[testdf.B.apply(lambda x: len(listStrings.intersection(x.split())) >= 1)]
This returns nothing.
The result I am trying to produce is return the first two rows since they contain ‘nRESULTS’ , but NOT the last row since it doesn’t have a ‘nRESULTS’
So
> A B
>0 test1 generates the final summary. nRESULTS We evaluate the performance of
>1 test2 the cat and bat nnnRESULTSn teamed up to find some food
Answers:
Usually we using str.contains
with regex=False
testdf[testdf.B.str.contains('n',regex=False)]
Can you try below:
import re
df1 = testdf[testdf['B'].str.contains('nRESULTS', flags = re.IGNORECASE)]
df1
#output
A B
0 test1 generates the final summary. nRESULTS We eva...
1 test2 the cat and bat nnnRESULTSn teamed up to f...
WeNYoBen’s solution is better, but one with iloc
and np.where
would be:
>>> testdf.iloc[np.where(testdf['B'].str.contains('n', regex=False))]
A B
0 test1 generates the final summary. nRESULTS We eva...
1 test2 the cat and bat nnnRESULTSn teamed up to f...
>>>
Sometimes if they are very confusing text with a lot t|n|r
, it is not able to find them,
I offer you a regular expression that collects all the cases
Example:
this code will take all the columns WHERE t|n|r
appear
df_r = df_r[df_r["Name"].astype(str).str.contains(r"\t|\n|\r", "t|n|r",regex=True)]
the answer has been inspired by: removing newlines from messy strings in pandas dataframe cells?
I am trying to return rows if a column contains a line break and specific word following it. So ‘nWord’.
Here is a minimal example
testdf = pd.DataFrame([['test1', ' generates the final summary. nRESULTS We evaluate the performance of ', ], ['test2', 'the cat and bat nnnRESULTSn teamed up to find some food'], ['test2' , 'anthropology with RESULTS pharmacology and biology']])
testdf.columns = ['A', 'B']
testdf.head()
> A B
>0 test1 generates the final summary. nRESULTS We evaluate the performance of
>1 test2 the cat and bat nnnRESULTSn teamed up to find some food
>2 test2 anthropology with RESULTS pharmacology and biology
listStrings = { 'nRESULTSn'}
testdf.loc[testdf.B.apply(lambda x: len(listStrings.intersection(x.split())) >= 1)]
This returns nothing.
The result I am trying to produce is return the first two rows since they contain ‘nRESULTS’ , but NOT the last row since it doesn’t have a ‘nRESULTS’
So
> A B
>0 test1 generates the final summary. nRESULTS We evaluate the performance of
>1 test2 the cat and bat nnnRESULTSn teamed up to find some food
Usually we using str.contains
with regex=False
testdf[testdf.B.str.contains('n',regex=False)]
Can you try below:
import re
df1 = testdf[testdf['B'].str.contains('nRESULTS', flags = re.IGNORECASE)]
df1
#output
A B
0 test1 generates the final summary. nRESULTS We eva...
1 test2 the cat and bat nnnRESULTSn teamed up to f...
WeNYoBen’s solution is better, but one with iloc
and np.where
would be:
>>> testdf.iloc[np.where(testdf['B'].str.contains('n', regex=False))]
A B
0 test1 generates the final summary. nRESULTS We eva...
1 test2 the cat and bat nnnRESULTSn teamed up to f...
>>>
Sometimes if they are very confusing text with a lot t|n|r
, it is not able to find them,
I offer you a regular expression that collects all the cases
Example:
this code will take all the columns WHERE t|n|r
appear
df_r = df_r[df_r["Name"].astype(str).str.contains(r"\t|\n|\r", "t|n|r",regex=True)]
the answer has been inspired by: removing newlines from messy strings in pandas dataframe cells?