How to split a string on commas or periods in nltk

Question

I want to separate a string on commas and/or periods in nltk. I’ve tried with sent_tokenize() but it separates only on periods.

I’ve also tried this code

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
ex_sent = "This is an example showing sentence filtration.This is how it is done, in the case of Python I want to learn more. So, that I can have some experience over it, by it I mean Python."
class CommaPoint(PunktLanguageVars):
    sent_end_chars = ('.','?','!',',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
n_w=tokenizer.tokenize(ex_sent)
print (n_w)

The output for the code above is

['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.n']

When I try to give ‘.’ without any space it is taking it as a word

I want the output as

['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.']

Asked By: Rejo Varghese

||

Source

Answer 1

How about something simpler with re:

>>> import re
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
>>> re.split(r'[.,]', sent)
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', '']

To keep the delimiter, you can use group:

>>> re.split(r'([.,])', sent)
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', '']

Answered By: alvas

Answer 2

in this case you maybe can replace all commas with dots in the string and then tokenize it:

from nltk.tokenize import sent_tokenize
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."

ex_sent = ex_sent.replace(",", ".")
n_w = sent_tokenize(texto2, 'english')
print(n_w)

Answered By: Antares

How to split a string on commas or periods in nltk

Question:

Answers: