Pandas delete rows where any column (but 2) does not match a regex

Question

My dataframe looks like that:

1   07.06.2022  06:55  <NA>      <NA>           3  <NA>   <NA>      5     5    <NA>   <NA>   <NA>         6  <NA>  <NA>
2   07.06.2022  08:20  <NA>         6        <NA>  <NA>      6   <NA>  <NA>    <NA>   <NA>   <NA>         1  <NA>  <NA>
3   07.06.2022  13:00  <NA>      <NA>           2  <NA>      4      6     5       1      5   <NA>         2  <NA>  <NA>
4   08.06.2022  06:40  <NA>      <NA>           3  <NA>   <NA>      1     4    <NA>   <NA>   <NA>         2  <NA>  <NA>
..         ...    ...   ...       ...         ...   ...    ...    ...   ...     ...    ...    ...       ...   ...   ...
79  20.07.2022  08:10  <NA>         2        <NA>  <NA>      5   <NA>  <NA>    <NA>   <NA>    6.0         4     3  <NA>
80  20.07.2022  10:15  <NA>         2        <NA>  <NA>   <NA>   <NA>  <NA>    <NA>   <NA>   <NA>         5  <NA>     3
81  20.07.2022  13:00  <NA>         2           6     5      5   <NA>   1-2    <NA>      6    3.0         3     5   1-1
82  20.07.2022  14:20  <NA>         2        <NA>  <NA>   <NA>   <NA>  <NA>    <NA>   <NA>    6.0         1     5  <NA>
83  21.07.2022  07:00     1      <NA>           6  <NA>   <NA>      4     5    <NA>   <NA>   <NA>         3  <NA>  <NA>

The first two columns are fix but after that there is no fix number of columns.
Sometimes a column of these additional columns contain invalid characters (NaN is valid!).

I try to delete those rows. But I don’t know how.

I found the following solutions – but none of them matches my case:

`df[df["col"].str.contains("this string")==False]`

In that case I need to know which column contains that string (or does not contain). But I’ve no idea how many columns are available. I can iterate over them all but it feels not the right way.
But contains might work with regex.

`df[df.isin(...)`

This needs specific items that are contained. But I don’t have a list of all valid possibilities.

df.eq(r’d([-]d)*’)

I tried to do something like:

df = df[df.neq(0).any(1)]

for regexes but it’s not supported.

Merging all columns to a new one

…and use the first approach to delete the rows. It is pretty easy to combine a fix number of columns:

df['new-col'] = df['col1'].str + df['col2'].str +

I tried

df['new-col'] = ''.join(df.columns[2:])

but that doesn’t work.

What I have

The regex that should match for each columns[2:] is: d([-]d)* (or the value should be NaN)

Are there any Ideas to delete rows where any column match (or does not match – inverting the regex is pretty easy) a regex?

sample-data as CSV

Datum,Zeit,User1,User2,User3,User4,User5,User6,User7,User8,User9,User10,User11,User12,User13,User14
03.06.2022,10:45,,,,,,6,,5,,6,,,,
07.06.2022,06:55,,,3,,,5,5,,,,6,,,
07.06.2022,08:20,,6,,,6,,,,,,1,,,
07.06.2022,13:00,,,2,,4,6,5,1,5,,2,,,
08.06.2022,06:40,,,3,,,x,4,,,,2,,,
08.06.2022,12:55,,,3,5,,5,6,5,4,,4,,,
09.06.2022,07:00,4,,4,,,2,,,,,x,,,
09.06.2022,09:52,,2-4,,2-6,,2-4,,,,,2-1,,,
09.06.2022,13:00,,4,,2,6,1-2,,1-3,3,,1-5,3,,
10.06.2022,07:00,2,,3,,,3,,,5,,6,,,
10.06.2022,09:50,,4,,,,,,,,,1,,,
10.06.2022,11:00,,6,,,,2,,,,,,,,
13.06.2022,6:30,1,,,,,4,6,,,,,,,
13.06.2022,10:00,,6,,1,,,,,4,,,,,
13.06.2022,13:00,,6,,,,1,4,,5,,,,2,
14.06.2022,8:00,5,,,,,5,4,,,,,,,
15.06.2022,8:00,1,,,,,4,2,,3,,,,,

online runnable sample

This running sample contains all the code as string. But I try to read an xlsx-file containing the data an I don’t want to do some tricks on the source data. A solution might be to export the xlsx to csv, load them into memory as string, run a regex to filter all lines and after that put it into pandas <- this sounds weird.

Asked By: Sebastian Schumann

||

Source

Answer 1

You want to delete rows that contain ‘x’ in columns other that the first two.

Use boolean indexing:

df[~df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('x')).any(1)]

NB. for an exact match use df.iloc[:, 2:].eq('x') in place of df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('x'))

output:

         Datum   Zeit  User1 User2  User3 User4  User5 User6  User7 User8  User9  User10 User11  User12  User13  User14
0   03.06.2022  10:45    NaN   NaN    NaN   NaN    NaN     6    NaN     5    NaN     6.0    NaN     NaN     NaN     NaN
1   07.06.2022  06:55    NaN   NaN    3.0   NaN    NaN     5    5.0   NaN    NaN     NaN      6     NaN     NaN     NaN
2   07.06.2022  08:20    NaN     6    NaN   NaN    6.0   NaN    NaN   NaN    NaN     NaN      1     NaN     NaN     NaN
3   07.06.2022  13:00    NaN   NaN    2.0   NaN    4.0     6    5.0     1    5.0     NaN      2     NaN     NaN     NaN
5   08.06.2022  12:55    NaN   NaN    3.0     5    NaN     5    6.0     5    4.0     NaN      4     NaN     NaN     NaN
7   09.06.2022  09:52    NaN   2-4    NaN   2-6    NaN   2-4    NaN   NaN    NaN     NaN    2-1     NaN     NaN     NaN
8   09.06.2022  13:00    NaN     4    NaN     2    6.0   1-2    NaN   1-3    3.0     NaN    1-5     3.0     NaN     NaN
9   10.06.2022  07:00    2.0   NaN    3.0   NaN    NaN     3    NaN   NaN    5.0     NaN      6     NaN     NaN     NaN
10  10.06.2022  09:50    NaN     4    NaN   NaN    NaN   NaN    NaN   NaN    NaN     NaN      1     NaN     NaN     NaN
11  10.06.2022  11:00    NaN     6    NaN   NaN    NaN     2    NaN   NaN    NaN     NaN    NaN     NaN     NaN     NaN
12  13.06.2022   6:30    1.0   NaN    NaN   NaN    NaN     4    6.0   NaN    NaN     NaN    NaN     NaN     NaN     NaN
13  13.06.2022  10:00    NaN     6    NaN     1    NaN   NaN    NaN   NaN    4.0     NaN    NaN     NaN     NaN     NaN
14  13.06.2022  13:00    NaN     6    NaN   NaN    NaN     1    4.0   NaN    5.0     NaN    NaN     NaN     2.0     NaN
15  14.06.2022   8:00    5.0   NaN    NaN   NaN    NaN     5    4.0   NaN    NaN     NaN    NaN     NaN     NaN     NaN
16  15.06.2022   8:00    1.0   NaN    NaN   NaN    NaN     4    2.0   NaN    3.0     NaN    NaN     NaN     NaN     NaN

How it works:

# invert condition
~

# select all but first 2 columns
df.iloc[:, 2:]

# check if the values contain "x"
.apply(lambda c: c.astype(str).str.contains('x'))

# True is any value in row is True
.any(1)

updated answer

keep rows with all among (NAs, digits, regex), but not anything other than that:

# which cells contain ^d(?:[-]d)*$ ?
m1 = df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('^d(?:[-]d)*$'))
# which cells are NA?
m2 = df.iloc[:, 2:].isna()
# which cells are numeric?
m3 = df.iloc[:, 2:].apply(pd.to_numeric, errors='coerce').notna()

# do ALL cells in a row match at least one of above?
mask = (m1|m2|m3).all(1)

# if True, keep the row
out = df[mask]

Answered By: mozway