Replace multiple fullstops with single fullstop

Question:

How do I replace multiple full stops with a single fullstop such that the NLTK sentence tokenizer can differentiate them as 2 different sentences

For e.g.

a = "the food was good...what about the bread huh..Awesome"

If i use

nltk.sent_tokenize(a)

It gives me

['the food was good...what about the bread huh..Awesome']

But what I want is

['the food was good.', 'what about the bread huh.', 'Awesome']

How do I do this?

Asked By: pd176

||

Answers:

You can do this by using a regex and substitute the occurrences of multiple dots by only a single one as shown below:

#!/usr/bin/env python3
# coding: utf-8

import re

a = "the food was good...what about the bread huh..Awesome"
a_replaced = re.sub(r'.+', ".", a)

Giving you:

'the food was good.what about the bread huh.Awesome'

In addition I’ll give you a small explanation about how this works. re.sub() accepts a regex pattern which should be replaced. In our case, this is r'.+'.

So let’s have a deeper look at this pattern. Since you’re looking for dots . we need to catch them. However, normally the dot sign . is used in regexes to match any character which is not what we want to achieve. In order to match the dot . and not any character we need to escape this character by adding the backslash in front of the dot giving ..

Since we want to find any occurrences of dots and we don’t know how many dots there would be we are just looking for ‘one ore more’ which we achieve by appending the + to our matching group ..

And there we are, having a working regex: .+ which we pass as r'.+' to show Python that this is a regex and not a normal string. Next, as stated in re.sub() docs, we need to specify a string which we want to put instead of our regex pattern. This is a single dot "." only, since you want to replace several dots with a single one. The third parameter we passed is your string a in which we need to do the desired replacements.

I do not want to advertise anything, but for a quick overview on regex in Python I can suggest this cheat sheet.

Answered By: albert

You could also use re.split for this purpose. It returns you a list as well

a="the food was good...what about the bread huh..Awesome"
sr = re.split(".+", a)
print sr

You get

['the food was good', 'what about the bread huh', 'Awesome']

Cheers!

Answered By: sameera sy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.