How to decompose twitter hashtags into words?

Question:

I’m trying to decompose twitter hashtags in order to extract the words that compose it. I’m having trouble finding a regular expression that can do this satisfactorily, mainly due to the authors’ "excessive creativity" in capitalization.

Some examples:

#itsAHashtag -> ['its', 'a', 'hashtag']
#GlazersOutNOW -> ['glazers', 'out', 'now']
#COVIDIsNotOver -> ['covid', 'is', 'not', 'over']

Is there any library that does this kind of decomposition?

Asked By: marciel.deg

||

Answers:

You could use a combination of capital letter split and a set of English words to compare with. The module english-words looks promising.

Answered By: scenox

Based upon the samples you provided, this regex should work for you,

(?:[A-Z]+|[a-zA-Z][a-z]+?)(?=[A-Z]|$)

Check this demo

And let me know if this works. I will add explanation if it works well.

Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.