Sorting rows by the number of list elements the row contains

Question:

Taking as example the following table:

index column_1 column_2
0 bli bli d e
1 bla bla a b c d e
2 ble ble a b c

If I give a token_list = ['c', 'e'] I want to order the table by the number of times the tokens each row contains in column number 2.

By ordering the table I should get the following:

index column_1 column_2 score_tmp
1 bla bla a b c d e 2
0 bli bli d e 1
2 ble ble a b c 1

Currently, I have reached the following way of doing this, but it is taking a lot of time. How could I improve the time? Thank you in advance.

df['score_tmp'] = df[['column_2']].apply(
            lambda x: len([True for token in token_list if
            token in str(x['column_2'])]), axis=1)
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
Asked By: Pedro Azevedo

||

Answers:

You can split column_2 based on whitespaces, convert each row into a set and then use df.apply with set intersection with sort_values:

In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()

In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
Out[204]: 
   index column_1   column_2
1      1  bla bla  a b c d e
0      0  bli bli        d e
2      2  ble ble      a b c

Timings:

In [208]: def f1():
     ...:     df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)
     ...:     results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
     ...: 

In [209]: def f2():
     ...:     df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
     ...:     df.sort_values('matches', ascending=False).drop('matches', 1)
     ...: 

In [210]: %timeit f1() # solution provided in question
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [211]: %timeit f2() # my solution
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: Mayank Porwal

Here is another way using str.count()

df.sort_values('column_2',
               key = lambda x: x.str.count('|'.join(token_list)),
               ascending=False)

Using the key parameter of sort_values(), we do not have to make a temporary column to do the sorting.

Output:

   index column_1   column_2
1      1  bla bla  a b c d e
0      0  bli bli        d e
2      2  ble ble      a b c
Answered By: rhug123

Another possible solution:

(df.assign(
     score_tmp =
     [len(np.intersect1d(x,token_list)) for x in df['column_2'].str.split(' ')])
 .sort_values('score_tmp', ascending=False))

Output:

  column_1   column_2  score_tmp
1  bla bla  a b c d e          2
0  bli bli        d e          1
2  ble ble      a b c          1
Answered By: PaulS
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.