Replace spaces with non-breaking spaces according to a specific criterion

Question:

I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.

For example:

If in a sentence, I have:

"You need to walk 5 km."

I need to replace the space between 5 and km with a non-breaking space.

So far, I have managed to do this:

import os

unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']

# iterate and read all files in the directory
for file in os.listdir():
    # check if the file is a file
    if os.path.isfile(file):
        # open the file
        with open(file, 'r', encoding='utf-8') as f:
            # read the file
            content = f.read()
            # search for exemple in the file
            for i in unites:
                if i in content:
                    # find the next character after the unit
                    next_char = content[content.find(i) + len(i)]
                    # check if the next character is a space
                    if next_char == ' ':
                        # replace the space with a non-breaking space
                        content = content.replace(i + ' ', i + 'u00A0')

But this replace all the spaces in the document and not the ones that I want.
Can you help me?


EDIT

after UlfR’s answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.

Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :

  • I want to search for the phrase "Can the search be hypothetical?"
    I would like the space between hypothetical and ? to be replaced by a non-breaking space.
  • Otherwise also "In the search it is necessary to refer to the "{figure 1.12}"
    I would like the space between {, figure and } to be a non-breaking space but also the space between figure and 1.12 to be a non-breaking space (so all spaces in this case).

I’ve tried to do this :

units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']

nbsp = 'u00A0'

rgx = re.sub(r'(bd+)(%s) (%s)b'%(units, units_before_after),r'1%s2'%nbsp,text))

print(rgx)

But I’am having some trouble, do you have any ideas to share ?

Asked By: Satanas

||

Answers:

You should use re to do the replacement. Like so:

import re

text = "You need to walk 5 km or 500000 cm."
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
nbsp = 'u00A0'

print(re.sub(r'(bd+) (%s)b'%'|'.join(units),r'1%s2'%nbsp,text))

Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:

  1. At the beginning of something b
  2. 1 or more digits d+
  3. One space
  4. One of the units km|m|cm|...
  5. At the end of something b

Then we replaces the all that with the two groups with the nbsp-string between them.

See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!

Have fun 🙂

Answered By: UlfR