What is the difference between pandas str.extractall() and pandas str.extract()?

Question:

I am trying to find all matched words from a column of strings and a giving word list. If I use pandas str.extract(), I can get the first matched word, since I needs all the matched words, so I think pandas str.extractall() will work, however, I only got a ValueError.

What is the problem here?

 df['findWord'] = df['text'].str.extractall(f"({'|'.join(wordlist)})").fillna('')
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
Asked By: newlearner101

||

Answers:

extract returns the first match. extractall generates one row per match.

Example, let’s match A and the following letter.

df = pd.DataFrame({'col': ['ABC', 'ADAE']})
#     col
# 0   ABC
# 1  ADAE

df['col'].str.extractall('(A.)')

This created a novel index level named "match" that identifies the match number. Matches from the same row are identified by the same first index level.

Output:

          0
  match    
0 0      AB
1 0      AD
  1      AE

With extract:

df['col'].str.extract('(A.)')

Output:

    0
0  AB
1  AD
aggregating the output of extractall
(df['col']
 .str.extractall('(A.)')
 .groupby(level='match').agg(','.join)
)

Output:

           0
match       
0      AB,AD
1         AE
Answered By: mozway